Disk replacement via hot swapping with variable parity

ABSTRACT

A system and method for enhanced disk replacement via hot swapping with variable parity is described. The system and method operate on a computer storage system that includes a plurality of disk drives for storing distributed parity groups. Each distributed parity group includes storage blocks. The storage blocks include one or more data blocks and a parity block that is associated with the one or more data blocks. Each of the storage blocks is stored on a separate disk drive such that no two storage blocks from a given parity set reside on the same disk drive. The computer storage system further includes file system metadata to describe a location of each of the storage blocks. The computer storage system further includes a resource-allocation module to recognize a new disk drive that is hot-swapped into the plurality of disk drives during file system operation and to use the new disk drive to store one or more storage blocks.

REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority benefit under 35 U.S.C.§119(e) from all of the following U.S. Provisional Applications, thecontents of which are hereby incorporated by reference in theirentirety:

[0002] U.S. Provisional Application No. 60/264,671, filed Jan. 29, 2001,titled “DYNAMICALLY DISTRIBUTED FILE SYSTEM”;

[0003] U.S. Provisional Application No. 60/264,694, filed Jan. 29, 2001,titled “A DATA PATH ACCELERATOR ASIC FOR HIGH PERFORMANCE STORAGESYSTEMS”;

[0004] U.S. Provisional Application No. 60/264,672, filed Jan. 29, 2001,titled “INTEGRATED FILE SYSTEM/PARITY DATA PROTECTION”;

[0005] U.S. Provisional Application No. 60/264,673, filed Jan. 29, 2001,titled “DISTRIBUTED PARITY DATA PROTECTION”;

[0006] U.S. Provisional Application No. 60/264,670, filed Jan. 29, 2001,titled “AUTOMATIC IDENTIFICATION AND UTILIZATION OF RESOURCES IN ADISTRIBUTED FILE SERVER”;

[0007] U.S. Provisional Application No. 60/264,669, filed Jan. 29, 2001,titled “DATA FLOW CONTROLLER ARCHITECTURE FOR HIGH PERFORMANCE STORAGESYSTEMS”;

[0008] U.S. Provisional Application No. 60/264,668, filed Jan. 29, 2001,titled “ADAPTIVE LOAD BALANCING FOR A DISTRIBUTED FILE SERVER”; and

[0009] U.S. Provisional Application No. 60/302,424, filed Jun. 29, 2001,titled “DYNAMICALLY DISTRIBUTED FILE SYSTEM”.

BACKGROUND OF THE INVENTION

[0010] 1. Field of the Invention

[0011] This invention relates to the field of data storage andmanagement. More particularly, this invention relates tohigh-performance mass storage systems and methods for data storage,backup, and recovery.

[0012] 2. Description of the Related Art

[0013] In modern computer systems, collections of data are usuallyorganized and stored as files. A file system allows users to organize,access, and manipulate these files and also performs administrativetasks such as communicating with physical storage components andrecovering from failure. The demand for file systems that providehigh-speed, reliable, concurrent access to vast amounts of data forlarge numbers of users has been steadily increasing in recent years.Often such systems use a Redundant Array of Independent Disks (RAID)technology, which distributes the data across multiple disk drives, butprovides an interface that appears to users as one, unified disk drivesystem, identified by a single drive letter. In a RAID system thatincludes more than one array of disks, each array is often identified bya unique drive letter, and in order to access a given file, a user mustcorrectly identify the drive letter for the disk array on which the fileresides. Any transfer of files from one disk array to another and anyaddition of new disk arrays to the system must be made known to users sothat they can continue to correctly access the files.

[0014] RAID systems effectively speed up access to data over single-disksystems, and they allow for the regeneration of data lost due to a diskfailure. However, they do so by rigidly prescribing the configuration ofsystem hardware and the block size and location of data stored on thedisks. Demands for increases in storage capacity that are transparent tothe users or for hardware upgrades that lack conformity with existingsystem hardware cannot be accommodated, especially while the system isin use. In addition, such systems commonly suffer from the problem ofdata fragmentation, and they lack the flexibility necessary tointelligently optimize use of their storage resources.

[0015] RAID systems are designed to provide high-capacity data storagewith built-in reliability mechanisms able to automatically reconstructand restore saved data in the event of a hardware failure or datacorruption. In conventional RAID technology, techniques includingspanning, mirroring, and duplexing are used to create a data storagedevice from a plurality of smaller single disk drives with improvedreliability and storage capacity over conventional disk systems. RAIDsystems generally incorporate a degree of redundancy into the storagemechanism to permit saved data to be reconstructed in the event ofsingle (or sometimes double) disk failure within the disk array. Saveddata is further stored in a predefined manner that is dependent on afixed algorithm to distribute the information across the drives of thearray. The manner of data distribution and data redundancy within thedisk array impacts the performance and usability of the storage systemand may result in substantial tradeoffs between performance,reliability, and flexibility.

[0016] A number of RAID configurations have been proposed to map dataacross the disks of the disk array. Some of the more commonly recognizedconfigurations include RAID-1, RAID-2, RAID-3, RAID-4, and RAID-5.

[0017] In most RAID systems, data is sequentially stored in data stripesand a parity block is created for each data stripe. The parity blockcontains information derived from the sequence and composition of thedata stored in the associated data stripe. RAID arrays can reconstructinformation stored in a particular data stripe using the parityinformation, however, this configuration imposes the requirement thatrecords span across all drives in the array resulting in a small stripesize relative to the stored record size.

[0018]FIG. 21 illustrates the data mapping approach used in manyconventional RAID storage device implementations. Although the diagramcorresponds most closely to RAID-3 or RAID-4 mapping schemas, other RAIDconfigurations are organized in a similar manner. As previouslyindicated, each RAID configuration uses a striped disk array 2110 thatlogically combines two or more disk drives 2115 into a single storageunit. The storage space of each drive 2115 is organized by partitioningthe space on the drives into stripes 2120 that are interleaved so thatthe available storage space is distributed evenly across each drive.

[0019] Information or files are stored on the disk array 2110.Typically, the writing of data to the disks occurs in a parallel mannerto improve performance. A parity block is constructed by performing alogical operation (exclusive OR) on the corresponding blocks of the datastripe to create a new block of data representative of the result of thelogical operation. The result is termed a parity block and is written toa separate area 2130 within the disk array. In the event of datacorruption within a particular disk of the array 10, the parityinformation is used to reconstruct the data using the information storedin the parity block in conjunction with the remaining non-corrupted datablocks.

[0020] In the RAID architecture, multiple disks a typically mapped to asingle ‘virtual disk’. Consecutive blocks of the virtual disk are mappedby a strictly defined algorithm to a set of physical disks with no filelevel awareness. When the RAID system is used to host a conventionalfile system, it is the file system that maps files to the virtual diskblocks where they may be mapped in a sequential or non-sequential orderin a RAID stripe. The RAID stripe may contain data from a single file ordata from multiple files if the files are small or the file system ishighly fragmented.

[0021] The aforementioned RAID architecture suffers from a number ofdrawbacks that limit its flexibility and scalability for use in reliablestorage systems. One problem with existing RAID systems is that the datastriping is designed to be used in conjunction with disks of the samesize. Each stripe occupies a fixed amount of disk space and the totalnumber of stripes allowed in the RAID system is limited by the capacityof the smallest disk in the array. Any additional space that may bepresent on drives having a capacity larger than the smallest drive goesunused as the RAID system lacks the ability to use the additional space.This further presents a problem in upgrading the storage capacity of theRAID system, as all of the drives in the array must be replaced withlarger capacity drives if additional storage space is desired.Therefore, existing RAID systems are inflexible in -terms of their drivecomposition, increasing the cost and inconvenience to maintain andupgrade the storage system.

[0022] A further problem with conventional RAID arrays resides in therigid organization of data on the disks of the RAID array. As previouslydescribed, this organization typically does not use available disk spacein an efficient manner. These systems further utilize a single fixedblock size to store data which is implemented with the restriction ofsequential file storage along each disk stripe. Data storage in thismanner is typically inefficient as regions or gaps of disk space may gounused due to the file organization restrictions. Furthermore, the fixedblock size of the RAID array is not able to distinguish between largefiles, which benefit from larger block size, and smaller files, whichbenefit from smaller block size for more efficient storage and reducedwasted space.

[0023] Although conventional RAID configurations are characterized asbeing fault-tolerant, this capability is typically limited to singledisk failures. Should more than one (or two) disk fail or becomeinoperable within the RAID array before it can be replaced or repairedthere is the potential for data loss. This problem again arises from therigid structure of data storage within the array that utilizessequential data striping. This problem is further exacerbated by thelack of ability of the RAID system to flexibly redistribute data toother disk areas to compensate for drive faults. Thus, when one drivebecomes inoperable within the array, the likelihood of data lossincreases significantly until the drive is replaced resulting inincreased maintenance and monitoring requirements when usingconventional RAID systems.

[0024] With respect to conventional data storage systems or othercomputer networks, conventional load balancing includes a variety ofdrawbacks. For example, decisions relating to load balancing aretypically centralized in one governing process, one or more systemadministrators, or combinations thereof. Accordingly, such systems havea single point of failure, such as the governing process or the systemadministrator. Moreover, load balancing occurs only when the centralizedprocess or system administrator can organize performance data, make adecision, and then transmit that decision throughout the data storagesystem or computer network. This often means that the such loadbalancing can be slow to react, difficult to optimize for a particularserver, and difficult to scale as the available resources expand orcontract. In addition, conventional load balancing typically is limitedto balancing processing and communications activity between serversonly.

SUMMARY OF THE INVENTION

[0025] The present invention solves these and other problems byproviding a dynamically distributed file system that accommodatescurrent demands for high capacity, throughput, and reliability, whilepresenting to the users a single-file-system interface that appears toinclude every file in the system on a single server or drive. In thisway, the file system is free to flexibly, transparently, and on-the-flydistribute and augment physical storage of the files in any manner thatsuits its needs, across disk drives, and across servers, and users canfreely access any file without having specific knowledge of the filescurrent physical location.

[0026] One embodiment includes a storage device and architecture whichpossesses features such as transparent scalability where disks ofnon-identical capacity can be fully-utilized without the “dead-space”restrictions associated with conventional disk arrays. In one embodimenta flexible storage space allocation system handles storing large andsmall file types to improve disk space utilization. In anotherembodiment an improved method for maintaining data integrity overcomesthe single drive (or double) fault limitation of conventional systems inorder to increase storage reliability while at the same time reducingmaintenance and monitoring requirements.

[0027] In one embodiment, distributed parity groups (DPG) are integratedinto the distributed file storage system technology. This architectureprovides capabilities for optimizing the use of disk resources by movingfrequently and infrequently accessed data blocks between drives so as tomaximize the throughput and capacity utilization of each drive.

[0028] In one embodiment, the architecture supports incorporation of newdisk drives without significant reconfiguration or modification of theexiting distributed file storage system to provide improved reliability,flexibility, and scalability. Additionally, the architecture permits theremoval of arbitrary disk drives from the distributed file storagesystem and automatically redistributes the contents of these drives toother available drives as necessary.

[0029] The distributed file storage system can proactively positionobjects for initial load balancing, such as, for example, to determinewhere to place a particular new object. Additionally, the distributedfile storage system can continue to proactively position objects,thereby accomplishing active load balancing for the existing objectsthroughout the system. According to one embodiment, one or more filtersmay be applied during initial and/or active load balancing to ensure oneor a small set of objects are not frequently transferred, or churned,throughout the resources of the system.

[0030] As used herein, load balancing can include, among other things,capacity balancing, throughput balancing, or both. Capacity balancingseeks balance in storage, such as the number of objects, the number ofMegabytes, or the like, stored on particular resources within thedistributed file storage system. Throughput balancing seeks balance inthe number of transactions processed, such as, the number oftransactions per second, the number of Megabytes per second, or thelike, handled by particular resources within the distributed filestorage system. According to one embodiment, the distributed filestorage system can position objects to balance capacity, throughput, orboth, between objects on a resource, between resources, between theservers of a cluster of resources, between the servers of other clustersof resources, or the like.

[0031] The distributed file storage system can comprise resources, suchas servers or clusters, which can seek to balance the loading across thesystem by reviewing a collection of load balancing data from itself, oneor more of the other servers in the system, or the like. The loadbalancing data can include object file statistics, server profiles,predicted file accesses, or the like. A proactive object positionerassociated with a particular server can use the load balancing data togenerate an object positioning plan designed to move objects, replicateobjects, or both, across other resources in the system. Then, using theobject positioning plan, the resource or other resources within thedistributed file storage system can execute the plan in an efficientmanner.

[0032] According to one embodiment, each server pushes objects definedby that server's respective portion of the object positioning plan tothe other servers in the distributed file storage system. By employingthe servers to individually push objects based on the results of theirobject positioning plan, the distributed file storage system provides aserver-, process-, and administrator-independent approach to objectpositioning, and thus load balancing, within the distributed filestorage system.

[0033] In one embodiment, the network file storage system includes afirst file server operably connected to a network fabric; a second fileserver operably connected to the network fabric; first file systeminformation loaded on the first file server; and second file systeminformation loaded on the second file server, the first file systeminformation and the second file system information configured to allow aclient computer operably connected to the network fabric to locate filesstored by the first file server and files stored by the second fileserver without prior knowledge as to which file server stores the files.In one embodiment, the first file system information includes directoryinformation that describes a directory structure of a portion of thenetwork file system whose directories are stored on the first fileserver, the directory information includes location information for afirst file, the location information includes a server id thatidentifies at least the first file server or the second file server.

[0034] In one embodiment, the network file storage system loads firstfile system metadata on a first file server operably connected to anetwork fabric; loads second file system metadata on a second fileserver connected to the network fabric, the first file system metadataand the second file system metadata include information to allow aclient computer operably connected to the network fabric to locate afile stored by the first file server or stored by the second file serverwithout prior knowledge as to which file server stores the file.

[0035] In one embodiment, the network file storage system performs afile handle lookup on a computer network file system by: sending aroot-directory lookup request to a first file server operably connectedto a network fabric; receiving a first lookup response from the firstfile server, the first lookup response includes a server id of a secondfile server connected to the network fabric; sending a directory lookuprequest to the second file server; and receiving a file handle from thesecond file server.

[0036] In one embodiment, the network file storage system allocatesspace by: receiving a file allocation request in a first file server,the first file server owning a parent directory that is to contain a newfile, the file allocation request includes a file handle of the parentdirectory; determining a selected file server from a plurality of fileservers; sending a file allocation request from the first server to theselected server; creating metadata entries for the new file in filesystem data managed by the selected file server; generating a filehandle for the new file; sending the file handle to the first fileserver; and creating a directory entry for the new file in the parentdirectory.

[0037] In one embodiment, the network file storage system includes: afirst file server operably connected to a network fabric; a second fileserver operably connected to the network fabric; first file systeminformation loaded on the first file server; and second file systeminformation loaded on the second file server, the first file systeminformation and the second file system information configured to allow aclient computer operably connected to the network fabric to locate filesowned by the first file server and files owned by the second file serverwithout prior knowledge as to which file server owns the files, thefirst file server configured to mirror at least a portion of the filesowned by the second file server, the first file server configured tostore information sufficient to regenerate the second file systeminformation, and the second file server configured to store informationsufficient to regenerate the first file system information.

[0038] In one embodiment, the network file storage system: loads firstfile system metadata on a first file server operably connected to anetwork fabric; loads second file system metadata on a second fileserver connected to the network fabric, the first file system metadataand the second file system metadata include information to allow aclient computer operably connected to the network fabric to locate afile stored by the first file server or stored by the second file serverwithout prior knowledge as to which file server stores the file;maintains information on the second file server to enable the secondfile server to reconstruct an information content of the first filesystem metadata; and maintains information on the first file server toenable the first file server to reconstruct an information content ofthe second file system metadata.

[0039] In one embodiment the computer network file storage system isfault-tolerant and includes: a first file server operably connected to anetwork fabric; a second file server operably connected to the networkfabric; a first disk array operably coupled to the first file server andto the second file server, a second disk array operably coupled to thefirst file server and to the second file server; first file systeminformation loaded on the first file server, the first file systeminformation including a first intent log of proposed changes to thefirst metadata; second file system information loaded on the second fileserver, the second file system information including a second intent logof proposed changes to the second metadata, the first file server havinga copy of the second intent log, the second file server maintaining acopy of the first intent log, thereby allowing the first file server toaccess files on the second disk array in the event of a failure of thesecond file server.

[0040] In one embodiment, a distributed file storage system provideshot-swapping of file servers by: loading first file system metadata on afirst file server operably connected to a network fabric, the first filesystem operably connected to a first disk drive and a second disk drive;loading second file system metadata on a second file server connected tothe network fabric, the second file system operably connected to thefirst disk drive and to the second disk drive; copying a first intentlog from the first file server to a backup intent log on the second fileserver, the first intent log providing information regarding futurechanges to information stored on the first disk drive; and using thebackup intent log to allow the second file server to make changes to theinformation stored on the first disk drive.

[0041] In one embodiment, a distributed file storage system includes: afirst file server operably connected to a network fabric; a file systemincludes first file system information loaded on the first file server,the file system configured to create second file system information on asecond file server that comes online sometime after the first fileserver has begun servicing file requests, the file system configured toallow a requester to locate files stored by the first file server andfiles stored by the second file server without prior knowledge as towhich file server stores the files.

[0042] In one embodiment, a distributed file storage system adds serversduring ongoing file system operations by: loading first file systemmetadata on a first file server operably connected to a network fabric;creating at least one new file on a second file server that comes onlinewhile the first file server is servicing file requests, the at least onenew file created in response to a request issued to the first fileserver, the distributed file system configured to allow a requester tolocate files stored by the first file server and files stored by thesecond file server without prior knowledge as to which file serverstores the files.

[0043] In one embodiment, a distributed file storage system includes:first metadata managed primarily by a first file server operablyconnected to a network fabric, the first metadata includes first filelocation information, the first file location information includes atleast one server id; and second metadata managed primarily by a secondfile server operably connected to the network fabric, the secondmetadata includes second file location information, the second filelocation information includes at least one server identifier, the firstmetadata and the second metadata configured to allow a requester tolocate files stored by the first file server and files stored by thesecond file server in a directory structure that spans the first fileserver and the second file server.

[0044] In one embodiment, a distributed file storage system stores databy: creating first file system metadata on a first file server operablyconnected to a network fabric, the first file system metadata describingat least files and directories stored by the first file server; creatingsecond file system metadata on a second file server connected to thenetwork fabric, the second file system metadata describing at leastfiles and directories stored by the second file server, the first filesystem metadata and the second file system metadata includes directoryinformation that spans the first file server and the second file server,the directory information configured to allow a requestor to find alocation of a first file catalogued in the directory information withoutprior knowledge as to a server location of the first file.

[0045] In one embodiment, a distributed file storage system balances theloading of servers and the capacity of drives associated with theservers, the file system includes: a first disk drive including a firstunused capacity; a second disk drive including a second unused capacity,wherein the second unused capacity is smaller than the first unusedcapacity; a first server configured to fill requests from clientsthrough access to at least the first disk drive; and a second serverconfigured to fill requests from clients through access to at least thesecond disk drive, and configured to select an infrequently accessedfile from the second disk drive and push the infrequently accessed filesto the first disk drive, thereby improving a balance of unused capacitybetween the first and second disk drives without substantially affectinga loading for each of the first and second servers.

[0046] In one embodiment, a distributed file storage system includes: afirst file server operably connected to a network fabric; a second fileserver operably connected to the network fabric; first file systeminformation loaded on the first file server; and second file systeminformation loaded on the second file server, the first file systeminformation and the second file system information configured to allow aclient computer operably connected to the network fabric to locate filesstored by the first file server and files stored by the second fileserver without prior knowledge as to which file server stores the files.

[0047] In one embodiment, a data engine offloads data transferoperations from a server CPU. In one embodiment, the server CPU queuesdata operations to the data engine.

[0048] In one embodiment, a distributed file storage system includes: aplurality of disk drives for storing parity groups, each parity groupincludes storage blocks, the storage blocks includes one or more datablocks and a parity block associated with the one or more data blocks,each of the storage blocks stored on a separate disk drive such that notwo storage blocks from a given parity set reside on the same diskdrive, wherein file system metadata includes information to describe thenumber of data blocks in one or more parity groups.

[0049] In one embodiment, a distributed file storage system stores databy: determining a size of a parity group in response to a write request,the size describing a number of data blocks in the parity group;arranging at least a portion of data from the write request according tothe data blocks; computing a parity block for the parity group; storingeach of the data blocks on a separate disk drive such that no two datablocks from the parity group reside on the same disk drive; and storingeach the parity block on a separate disk drive that does not contain anyof the data blocks.

[0050] In one embodiment, a distributed file storage system includes: aplurality of disk drives for storing parity groups, each parity groupincludes storage blocks, the storage blocks includes one or more datablocks and a parity block associated with the one or more data blocks,each of the storage blocks stored on a separate disk drive such that notwo storage blocks from a given parity set reside on the same diskdrive; a redistribution module to dynamically redistribute parity groupsby combining some parity groups to improve storage efficiency.

[0051] In one embodiment, a distributed file storage system stores databy: determining a size of a parity group in response to a write request,the size describing a number of data blocks in the parity group;arranging at least a portion of data from the write request according tothe data blocks; computing a parity block for the parity group; storingeach of the data blocks on a separate disk drive such that no two datablocks from the parity group reside on the same disk drive; storing theparity block on a separate disk drive that does not contain any of thedata blocks; and redistributing the parity groups to improve storageefficiency.

[0052] In one embodiment, a distributed file storage system includes: aplurality of disk drives for storing parity groups, each parity groupincludes storage blocks, the storage blocks includes one or more datablocks and a parity block associated with the one or more data blocks,each of the storage blocks stored on a separate disk drive such that notwo storage blocks from a given parity set reside on the same diskdrive; and a recovery module to dynamically recover data lost when atleast a portion of one disk drive in the plurality of disk drivesbecomes unavailable, the recovery module configured to produce areconstructed block by using information in the remaining storage blocksof a parity set corresponding to an unavailable storage block, therecovery module further configured to split the parity groupcorresponding to an unavailable storage block into two parity groups ifthe parity group corresponding to an unavailable storage block spannedall of the drives in the plurality of disk drives.

[0053] In one embodiment, a distributed file storage system stores databy: determining a size of a parity group in response to a write request,the size describing a number of data blocks in the parity group;arranging at least a portion of data from the write request according tothe data blocks; computing a parity block for the parity group; storingeach of the data blocks on a separate disk drive such that no two datablocks from the parity group reside on the same disk drive; storing theparity block on a separate disk drive that does not contain any of thedata blocks; reconstructing lost data by using information in theremaining storage blocks of a parity set corresponding to an unavailablestorage block to produce a reconstructed parity group; splitting thereconstructed parity group corresponding to an unavailable storage blockinto two parity groups if the reconstructed parity group is too large tobe stored on the plurality of disk drives.

[0054] In one embodiment, a distributed file storage system integratesparity group information into file system metadata.

BRIEF DESCRIPTION OF THE DRAWINGS

[0055] These and other aspects, advantages, and novel features of theinvention will become apparent upon reading the following detaileddescription and upon reference to the accompanying drawings:

[0056]FIG. 1 is a general overview of a distributed file storage systemshowing clients, a communication fabric, and a plurality of servers withassociated disk arrays.

[0057]FIG. 2 is a block diagram of a server node.

[0058]FIG. 3 is a block diagram of five metadata structures andconnections between the five metadata structures.

[0059]FIG. 4 shows an example portion of a Filename Table.

[0060]FIG. 5 shows an example of a Gee-string stored in a Gee Table.

[0061]FIG. 6 shows one embodiment of the structure of a G-node.

[0062]FIG. 7 shows one embodiment of the structure of a Gnid-string.

[0063]FIG. 8A shows one embodiment of the structure of a Cache Node.

[0064]FIG. 8B shows a conceptual division of a Cache Node Table intothree lists.

[0065]FIG. 9 shows a sample portion of a lock string.

[0066]FIG. 10 shows one embodiment of Refresh Nodes configured as abinary tree.

[0067]FIG. 11 shows one embodiment of Refresh Nodes configured as adoubly-linked list

[0068]FIG. 12 shows one embodiment of the structure of an Intent LogEntry.

[0069]FIG. 13 shows one embodiment of the structure of a file handle.

[0070]FIG. 14A is a block diagram depicting one embodiment of a filehandle look-up process.

[0071]FIG. 14B is a block diagram depicting one embodiment of a fileaccess process.

[0072]FIG. 15 is a flow chart depicting one embodiment of performing afile access.

[0073]FIG. 16 is a flow chart depicting one embodiment of performing afile handle look-up.

[0074]FIG. 17 is a flow chart depicting one embodiment of caching filedata.

[0075]FIG. 18 is a flow chart depicting one embodiment of fileallocation.

[0076]FIG. 19 shows one embodiment of Super G-nodes.

[0077]FIG. 20A shows one embodiment of a Super G-node.

[0078]FIG. 20B shows one embodiment of a scheme to use Super G-nodes tohold metadata for files of widely varying sizes.

[0079]FIG. 21 illustrates a conventional disk array that incrementallystripes data in a RAID mapping architecture.

[0080]FIG. 22A illustrates one embodiment of a distributed file storagesystem.

[0081]FIG. 22B illustrates another embodiment of a distributed filestorage system having built in data redundancy.

[0082]FIG. 23 illustrates a distributed file storage mechanism.

[0083]FIG. 24A illustrates a data and parity information storage method.

[0084]FIG. 24B illustrates another data and parity information storagemethod.

[0085]FIG. 25 illustrates another embodiment of a distributed filestorage system having a variable capacity disk array.

[0086]FIG. 26A illustrates an embodiment of variable block number paritygroups.

[0087]FIG. 26B illustrates an embodiment of variable size parity groups.

[0088]FIG. 27 illustrates one embodiment of a G-table used to determineparity group mapping.

[0089]FIG. 28 illustrates a method for storing data in the distributedfile storage system.

[0090]FIG. 29 illustrates another embodiment of a G-table mappingstructure.

[0091]FIG. 30 illustrates one embodiment of a fault-tolerant restorationprocess.

[0092]FIG. 31 illustrates a method for recovering corrupted or lost datain the distributed file storage system.

[0093]FIG. 32A illustrates one embodiment of a variably sized paritygroup used to store files.

[0094]FIG. 32B illustrates another embodiment of a variably sized paritygroup used to store files.

[0095]FIG. 33 illustrates a data storage process used by the distributedfile storage system.

[0096] FIGS. 34A-C illustrate a parity set redistribution process.

[0097]FIG. 35A illustrates one embodiment of a parity group dissolutionprocess.

[0098]FIG. 35B illustrates one embodiment of a parity groupconsolidation process.

[0099]FIG. 36 illustrates a parity group monitoring process.

[0100]FIG. 37 illustrates a parity group optimization/de-fragmentationprocess.

[0101]FIG. 38 illustrates a load balancing method used by thedistributed file storage system.

[0102]FIG. 39 depicts a block diagram of an exemplary embodiment ofservers and disk arrays of a distributed file storage system, whichhighlights the proactive object positioning of aspects of an exemplaryembodiment of the invention.

[0103]FIG. 40 depicts a block diagram of an exemplary server of FIG. 39,according to aspects of an exemplary embodiment of the invention.

[0104]FIG. 41 depicts an object positioning plan for Server F3 of FIG.39, according to aspects of an exemplary embodiment of the invention.

[0105]FIG. 42 is a block diagram of a server that provides efficientprocessing of data transfers between one or more client computers andone or more disk drives.

[0106]FIG. 43 is a block diagram of a data engine.

[0107]FIG. 44 is a map of data fields in a 64-bit data transferinstruction to the data engine for use with a 64-bit PCI bus.

DETAILED DESCRIPTION

[0108] Introduction

[0109] As data storage requirements increase, it is desirable to be ableto easily increase the data storage capacity and/or performance of adata storage system. That is, it is desirable to be able to increase theavailable capacity and performance of a storage system without modifyingthe configuration of the clients accessing the system. For example, in atypical Personal Computer (PC) network environment, if a databaseaccesses a network drive “M”, it is desirable to be able to add storageto this drive, all the while still calling the drive “M”, as opposed toadding, say, drives “N”, “O”, and “P” as storage requirements increase.In some cases, having to switch from a single drive “M” to four drives,“M”, “N”, “O”, “P” is a mere nuisance. However, in some cases such achange requires significant reconfiguration of client configurations. Inother cases, such a change requires modification of existing applicationsoftware, and in some instances such a change simply will not work withthe application being used.

[0110] The objective for more capacity can be met in some storagesystems by adding additional disk drives to the system. However, thismay not result in increasing performance. In fact, adding additionaldrives may cause a significant decrease in performance. This is because:(1) if more ports are not added to the system when new drives are added,the performance decreases because now more data is available (andpresumably being accessed) through the same performance ports; and (2)the controller managing the file system metadata has more operations toperform and can become a bottleneck. Adding drives to existing systemsmay also limited by physical form factors. That is to say, that somesystems have physical limits to how many drives can be added.

[0111] In one embodiment, the system described herein provides aDistributed File Storage System (DFSS) that can scale disk capacity,scale data throughput (e.g., megabytes per second of data delivery); andscale transaction processing throughput (e.g., processing of file systemmetadata). In one embodiment, the system also provides load balancingsuch that the scaled components handle the workload with improvedefficiency. In one embodiment, the DFSS is dynamically distributed. Inone embodiment, the DFSS allows the integration of multiple servers sothat the aggregation of servers appears to a client as a single storagedevice. With the DFSS, multiple servers can access and control the samedisk array, separate disk arrays, or both simultaneously. The DFSS isdesigned so that each server can continue to read and write data to thedrives it controls even when other controllers in the DFSS fail. TheDFSS also provides a mechanism for balancing the load on the controllersand the drives.

[0112] In one embodiment, the DFSS is designed such that when multiplecontrollers are controlling a single array of disk drives (also called adrive array), some or all of the servers connected to the drive arrayhave valid copies of the file system metadata describing the data onthat drive array. This means that each server has direct access to allof the file system metadata for one or more of the drive arrays it canaccess. Thus: (1) a server can continue to operate normally if the otherservers in the system fail; and (2) there is little or no performancedegradation due to one server polling another server regarding locationof data on drive arrays. The DFSS provides inter-server communication tomaintains synchronization of the file system metadata. The DFSS isdesigned such that a server can read from more than one drive array andcan read from drive arrays maintained by another server. In oneembodiment, only one controller attached to a particular drive array haswrite privileges for that particular drive array at a given time.

[0113] The DFSS maintains a description of which servers have read andwrite privileges to a file represented by a file handle passed to theclient. When the client looks up a file handle, the client is informedof its options regarding which servers it may read the data from (whichis typically several) and which one server it needs to use to writedata. In addition, since the servers typically have multiple networkinterface cards (ports) to the client network, the file handle alsoincludes data which suggests to the client which port is likely to bethe least utilized.

[0114] The DFSS is also designed such that when there are multipleservers, which are not sharing the same drive arrays, the drive arraysare seamlessly integrated. For example, suppose a system has 4 servers(numbered S1, S2, S3, and S4) and two drive arrays, numbered (A1, andA2). Further suppose that S1 and S2 control A1 and that S3 and S4control A2. The DFSS allows for a directory on A1 to have children onA2. In fact, the file system keeps track of usage statistics, and if A2is less utilized than A1, the file system will automatically create thenext files on A2 instead of A1. The DFSS provides coordination betweenthe servers to allow this level of integration.

[0115] Because each server has a complete set of metadata for each drivearray it can access, a particular server can continue to operate even ifother servers fail. The DFSS includes a mechanism for determining if acontroller has failed and a mechanism for transferring write privilegesin such cases. Clearly if all controllers attached to a given drivearray fail, the data on that drive array will become inaccessible.However, the capability to support multiple controllers for each drivearray greatly reduces the likelihood of such an event. If all suchcontrollers for a drive array fail, read and write operations on theremaining controller/drive arrays continue unhindered.

[0116] The DFSS can perform load balancing at three levels. First, whena directory lookup is performed, the file system encodes within the filehandle the lesser-used network interface to provide balancing of networkinterface resources. Second, when a new file is created, it is createdon lesser-used drives and owned by a lesser-used server. Third, dynamicanalysis of loading conditions is performed to identify under-utilizedand over-utilized drives. In response, the file system in some casesredistributes the parity groups across the drives in the existing drivearray for more optimum usage of parity checking, and in other cases thefile system moves files to lesser used drive arrays.

[0117] Many data storage systems are designed with the twin goals ofproviding fast access to data and providing protection against loss ofdata due to the failure of physical storage media. Prior art solutionstypically relied on Redundant Arrays of Independent Disks (RAID). Byhaving the data striped across multiple drives, the data can be accessedfaster because the slow process of retrieving data from disk is done inparallel, with multiple drives accessing their data at the same time. Byallocating an additional disk for storing parity information, if any onedisk fails, the data in the stripe can be regenerated from the remainingdrives in the stripe.

[0118] While this approach has proven effective in many applications, itdoes have a few fundamental limitations, one of this is that there is arigid algorithm for mapping addresses from the file system to addresseson the drives in the array. Hence stripes are created and maintained ina rigid manner, according to a predetermined equation. An unfortunateside effect results from this limitation. There is no mechanism fromkeeping data from a particular file from becoming highly fragmented,meaning that although the data could actually fit in a single stripe,the data could actually be located in many of stripes (this situationcan be particularly acute when multiple clients are writing to a filesystem).

[0119] In one embodiment, the DFSS abandons the notion of having a rigidalgorithm to map from addresses in the file system to drive addresses.Instead, DFSS uses Distributed Parity Groups (DPGs) to perform themapping. Data blocks in the DPGs are mapped via a mapping table (or alist of tables) rather than a fixed algorithm, and the blocks are linkedtogether via a table of linked lists. As discussed below, the DPGmapping can be maintained separately or can be integrated into the filesystem metadata.

[0120] Initially the mapping is somewhat arbitrary and is based on theexpectation that the drives will be accessed evenly. However, the systemkeeps track of drive usage frequency. As patterns of usage areestablished, blocks are copied from frequently accessed drives toinfrequently accessed drives. Once the copy is complete, the blocks areremapped to point to the new copies.

[0121] The disk drives are viewed as consisting of a collection ofblocks. The block size is typically an integer multiple of the drivesector size. The drive sector size is a characteristic of the drives,and is the minimum size of data that can be written to the drives. Formost Fibre Channel drives, the sector size is 512 bytes.

[0122] In one embodiment, the blocks are grouped via a G-Table. TheG-table is a collection of Gees, which represent the individual blocksand their linkage. Each Gee contains a code that identifies what thatthe Gee's purpose is (e.g., linkage or representing data). Gees for aDPG strung together into a G-group. The entire G-table is cached, eitherin whole or in part, in Random Access Memory (RAM). Individual Gees aremodified in cache to indicate when a specific block of data is in cache.This provides a straightforward way to be assured that if any client hascaused disk data to be cached, any other client seeking that same datawill be directed to the already cached data.

[0123] RAID systems are implemented independently from the file system.That is, from the file system's point of view, the array looks like onebig disk. Hence stripes are created and maintained without any knowledgeof the data they contain. Two unfortunate side effects result from thislimitation. First, there is no mechanism from keeping data from aparticular file from becoming highly fragmented, meaning that althoughthe data could actually fit in a single stripe, the data could actuallybe located many stripes (this situation can be particularly acute whenmultiple clients are writing to files). The can result in each drivedoing hundreds of seeks, while a smarter system could do just one. Thisis significant because the seek is the slowest operation related toaccessing data on disks.

[0124] Second, when a drive fails, the data on that drive must beregenerated on a replacement drive exactly as it was on the faileddrive. This means that if, for example, a server that has only 10% ofits disk space currently used, can only regenerate the data onto areplacement drive (or a hot spare) even though there is more than enoughdisk space to regenerate the data onto the other disks. For remoteinstallations, if a hot spare is used, once one failure occurs, the hotspare is used and the system can no longer tolerate another failureuntil the bad drive is replaced. Of curse this could be lessened by theusage of multiple hot spares, but that significantly increases theamount of disk storage that is not being used and merely “waiting in thewings”.

[0125] In one embodiment, the DFSS management of the DPGs is integratedinto the file system, thus making the file system “aware” of the DPGsand how data blocks from a file are collected into parity groups. Makingthe file system aware of the DPGs allows the file servers in the DFSS tomore intelligently use the disk arrays than a RAID system would. Withthe DPG system, the file system has knowledge of the drive arrays andtherefore reduces the kind of fragmenting that is typical of RAIDsystems.

[0126] Furthermore, in the event of a failure of one drive in the DFSS,the data from the failed drive can be redistributed across the remainingdrives in a disk array. For example, suppose a file contained a DPGhaving a length (also known as a “span”) of 9 (data spread across 9drives, where 8 drives contain the data blocks and the ninth drivecontains the parity block). When one drive fails, the data can beregenerated and redistributed using a DPG of span 8. Note that withoutknowledge of which blocks are associated with which files, thisredistribution is not possible, because the file must still have thesame number of total blocks, but when the span is reduced from 9 to 8,there is an orphan block of 1 which must be still associated with thefile. This orphan is associated with another DPG in the same file. Thisassociation is not possible without knowledge of the file.Alternatively, if there are at least ten disks in the disk array, thedata can be regenerated and redistributed using a DPG span of 9,omitting the failed drive. Thus, the integration of DPG management intothe file system provides flexibility not available in a conventionalRAID system.

[0127] Sine the DFSS has full knowledge of the file system, the DFSS hasknowledge of which blocks on the disks are not used. This allows theDFSS to identify heavily used disks and redistribute data fromheavily-used disks to unused blocks on lesser-used blocks.

[0128] Storage system capability is typically measured in capacity,bandwidth, and the number of operations per second that can beprocessed. It is desirable to be able to easily scale a storage system,that is, to be able to easily increase the storage capacity, thebandwidth, or the operations per second capacity of the storage system.Storage system capacity is scaled by adding disk drives or to replacedisk drive with drives having greater capacity. To increase storagesystem bandwidth or transactions per second capacity, it is typicallynecessary to add servers. It is desirable to be able to add and utilizethese resources with little or no user intervention or configuration.

[0129] In one embodiment, the DFSS can automatically identify andutilize available resources, including disk drives and servers. Twofeatures are used realize this: 1) detecting the addition of disk drivesand/or servers; and 2) a automatically initializing and incorporatingnewly added disk drives and/or servers. The same mechanisms that areused to detect newly-added resources can also be used to support thedeletion of resources.

[0130] With regard to detection of new resources, modern, highperformance networking technologies such as Fibre Channel and GigabitEthernet supply methods for determining what devices are connected tothe network. By storing the device map, and periodically querying thenetwork for an updated device map, the presence of new devices can bedetermined. New devices are added to the appropriate server resourcemap.

[0131] In one embodiment, a resource manager in the DFSS provides thecapability to incorporate the new resources automatically. The resourcemanager keeps track of available disk resources, as measured inavailable disk devices and the available free blocks on each disk. Theresource manager keeps track of the available servers and the unutilizedcapacity, in terms of bandwidth and transactions per second, of eachserver. When new resources are added to the DFSS, the resource managerincorporates the additions into a resource database.

[0132] The resource manager works in conjunction with aspects of theDFSS to dynamically allocate storage and controller resources to files.When the DFSS needs to create a new file, or extend an already createdfile, it coordinates with the resource manager to create a DPG of theappropriate size. A similar approach is followed by the DFSS in theselection of which server to use in the creation of a new file.

[0133] The resource manager approach also supports a load balancingcapability. Load balancing is useful in a distributed file system tospread the workload relatively uniformly across all of the availableresources (e.g., across disks, network interfaces, and servers). Theability to proactively relocate file data is a tool that can be used tosupport load balancing by moving file data from over-utilized resourcesto under-utilized resources. In one embodiment, the resource managersupports load balancing by incorporating resource usage predictions.

[0134] In the DFSS, the server workload includes communication withclient machines, reading and writing files from disks, managing filemetadata, and managing server resources such as storage capacity. Theworkload is divided up among the server hardware resources. If theworkload is evenly divided, the resulting performance will be improved.Thus, one key to performance is intelligent resource management. In oneembodiment, resource management involves adaptive load balancing ofserver workloads. Prior art distributed file system technologies do notoffer an effective method of performing load balancing in the face of adynamic load environment and thus cannot provide optimum performance.

[0135] In one embodiment adaptive load balancing is based on theimplementation of two mechanisms. First, a mechanism is provided topredict the future server workload. Second, a mechanism is provided toreallocate distributed server resources in response to the predictedworkload.

[0136] Prediction of the future workload has several aspects. The firstof these aspects is the past history of server workload, in terms iffile access statistics, server utilization statistics, and networkutilization statistics. The loading prediction mechanism uses thesestatistics (with an appropriate filter applied) to generate predictionsfor future loading. As a very simple example, a file that hasexperienced heavy sequential read activity in the past few minutes willlikely continue to experience heavy sequential read access for the nextfew minutes.

[0137] The predictions for future workload can be used to proactivelymanage resources to improve performance and capacity usage. Onemechanism used to reallocate server workload is the movement andreplication of content (files) such that server and storage utilizationis balanced and the direction of client accesses to available servers isbalanced. Some degree of cooperation from client machines can be used toprovide more effective load balancing, but client cooperation is notstrictly required.

[0138] A file server contains a number of hardware resources, includingcontrollers, storage elements (disks), and network elements. In theconfiguration used by the DFSS, multiple client machines are connectedthrough a (possibly redundant) client network to one or more serverclusters. Each server cluster has one or more servers and a disk storagepool.

[0139] Software resident on each server collects statistics regardingfile accesses and server resource utilization. This includes informationregarding the access frequency, access bandwidth and access locality forthe individual files, the loading of each disk controller and diskstorage element in terms of CPU utilization, data transfer bandwidth,transactions per second, and the loading of each network element interms of network latency and data transfer bandwidth.

[0140] The collected statistics are subjected to various filteroperations, which results in a prediction of filture file and resourceutilization (i.e., workload). This prediction can also be modified byserver configuration data which has been provided in advance by a systemadministrator, and explicit “hints” regarding future file and/orresource usage which can be provided directly from a client machine.

[0141] The predicted workload is then used to develop a plan that whereto move content (files) between storage elements and where to directclient accesses to controllers in such a manner that the overallworkload is distributed as evenly as possible, resulting in best overallload balance and distributed server performance. The predicted workloadcan be used to perform the following specific types of load balancing:

[0142] 1) Client Network Load Balancing, which includes managing clientrequests to the extent possible such that the client load presented tothe servers in a cluster, and the load present to the network portswithin each cluster is evenly balanced.

[0143] 2) Intra-Cluster Storage Load Balancing, which includes of themovement of data between the disks connected to a controller clustersuch that the disk bandwidth loading among each of the drives in anarray, and the network bandwidth among network connecting disk arrays toservers is balanced. There are two goals. The first goal is to achieverelatively uniform bandwidth loading for each storage sub-network. Thesecond goal is to achieve relatively uniform bandwidth loading for eachindividual disk drive. This is accomplished by moving relativelyinfrequently accessed material to drives with frequently accessedmaterial.

[0144] 3) Inter-Node Storage Load Balancing, which includes the movementof data between drives connected to different clusters to equalize diskaccess load between clusters. This is done at a higher cost thanIntra-Node Drive Load Balancing, as file data must actually be copiedbetween controllers over the client network.

[0145] 4) Intra-Node Storage Capacity Balancing, which includes movementof data between the disks connected to a server (or servers in acluster) to balance disk storage utilization among each of the drives.

[0146] 5) Inter-Node Storage Capacity Balancing, which includes movementof data between drives connected to different servers to equalizeoverall disk storage utilization among the different servers. This isdone at a higher cost than Intra-Node Drive Capacity Balancing, as filedata must actually be copied between controllers over the network.

[0147] 6) File Replication Load Balancing, which includes load balancingthough file replication. This is an extension of Inter-Node Drive LoadBalancing. High usage files are replicated so that multiple controllerclusters have one or more that one local (read-only) copy. This allowsthe workload associated with these heavily-accessed files to bedistributed across a larger set of disks and servers.

[0148] Disks and servers in the DFSS can be “hot swapped” and “hotadded” (meaning they can be replaced or added while the DFSS is onlineand servicing file requests. Disks in a disk array need not match incapacity or throughput. Extra capacity is automatically detected,configured, and used. Data is redistributed in the background (bothacross servers and across DPGs) to improve system performance. Hotadding of servers allows for increased file operations per second andfile system capacity. Hot-added servers are automatically configured andused.

[0149] In one embodiment, servers are arranged in clusters that operateas redundant groups (typically as redundant pairs). In normal operation,the servers in a cluster operate in parallel. Each acts as a primaryserver for a portion of the file system. Each server in a clustermaintains a secondary copy of the metadata and intent log of the other'sprimary file system metadata and intent log. The intent log tracksdifferences between metadata stored in memory (e.g., metadata in ametadata cache) and metadata stored on disk. Upon failure of a server inthe cluster, the server remaining server (or servers) will pick up theworkload of the failed server with no loss of metadata or transactions.

[0150] Each server in a high-performance data storage system includesstorage controller hardware and storage controller software to manage anarray of disk drives. Typically, a large number of disk drives are usedin a high performance storage system, and the storage system in turn isaccessed by a large number of client machines. This places a largeworkload on the server hardware and server software. It is thereforeimportant that the servers operate in an efficient manner so that theydo not become a bottleneck in the storage system. In one embodiment, ahigh-performance data path is provided in the server so that data canefficiently be moved between the client machines and disks with aminimum amount of software intervention.

[0151] Prior art approaches for server and storage controllers tend tobe software intensive. Specifically, a programmable CPU in the serverbecomes involved in the movement of data between the client and thedisks in the disk array. This limits the performance of the storagesystem because the server CPU becomes a bottleneck. While currentapproaches may have a certain degree of hardware acceleration, such asXOR parity operations associated with RAID, these minimal accelerationtechniques do not adequately offload the server CPU.

[0152] In one embodiment, the DFSS uses a server architecture thatlargely separates the data path from the control message path. Controlmessages (e.g. file read/write commands from clients) are routed to ahost CPU in the server. The host CPU processes the commands, and sets upthe network and storage interfaces as required to complete the datatransfer operations associated with the commands. The data transferoperations, once scheduled with the network and storage interfaces canbe completed without further CPU involvement, thus significantlyoffloading the host CPU. In one embodiment, a data flow architecturepackages instructions with data as it flows between the networkinterfaces and data cache memories.

[0153] The server hardware and software perform the functions ofinterfacing with client via the network interfaces, servicing clientfile operation requests, setting up disk read and write operationsneeded to service these requests, and updating the file metadata asnecessary to manage the files stored on disk.

[0154] The controller hardware provides a control flow path from thenetwork and storage interfaces to the host CPU. The host CPU isresponsible for controlling these interfaces and dealing with the highlevel protocols necessary for client communications. The host CPU alsohas a non-volatile metadata cache for storing file system metadata.

[0155] A separate path for data flow is provided that connects thenetwork and storage interfaces with a non-volatile data cache. In oneembodiment, the separate path for data flow is provided by a dataengine. The data path is used for bulk data transfer between the networkand storage interfaces. As an example of the data path operation,consider a client file read operation. A client read request is receivedon one of the network interfaces and is routed to the host CPU. The hostCPU validates the request, and determines from the request which data isdesired. The request will typically specify a file to be read, and theparticular section of data within the file. The host CPU will use filemetadata to determine if the data is already present in the data cachememory, or if it must be retrieved from the disks. If the data is in thedata cache, the CPU will queue a transfer with the network interface totransfer the data directly from the data cache to the requesting client,with no further CPU intervention required. If the data is not in thedata cache, the CPU will queue one or more transfers with the storageinterfaces to move the data from disk to the data cache, again withoutany further CPU intervention. When the data is in the data cache, theCPU will queue a transfer on the network interface to move the data tothe requesting client, again with no further CPU intervention.

[0156] One aspect of this autonomous operation is that the CPU schedulesdata movement operations by merely writing an entry onto a network orstorage interface queue. The data engine and the network and storageinterfaces are connected by busses that include address and data buses.In one embodiment, the network or storage interface does the actual datamovement (or sequence of data movements) independently of the CPU byencoding an instruction code in the address bus that connects the dataengine to the interface. The instruction code is set up by the host CPUwhen the transfer is queued, and can specify that data is to be writtenor read to one or both of the cache memories. In addition, it canspecify that an operation such as a parity XOR operation or a dataconversion operation be performed on the data while it is in transit.Because instructions are queued with the data transfers, the host CPUcan queue hundreds or thousands of instructions in advance with eachinterface, and all of these can be can be completed asynchronously andautonomously. The data flow architecture described above can also beused as a bridge between different networking protocols.

[0157] As described above, the data engine offloads the host CPU directinvolvement in the movement of data from the client to the disks andvice-versa. The data engine can be a general purpose processor, digitalsignal processor, programmable FPGA, other forms of soft or hardprogrammable logic, or a fully custom ASIC.

[0158] The data engine provides the capability for autonomous movementof data between client network interfaces and data cache memory, andbetween disk network interfaces and cache memory. The server CPUinvolvement is merely in initializing the desired transfer operations.The data engine supports this autonomy by combining an asynchronous dataflow architecture, a high-performance data path than can operateindependently of the server CPU data paths, and a data cache memorysubsystem. The data engine also implements the parity generationfunctions required to support a RAID-style data protection scheme.

[0159] The data engine is data-flow driven. That is, the instructionsfor the parallel processing elements are embedded in data packets thatare fed to the data engine and to the various functional blocks withinthe data engine.

[0160] In one embodiment, the data engine has four principal interfaces:two data cache RAM interfaces, and two external bus interfaces. Otherversions of the data engine can have a different number of interfacesdepending on performance goals.

[0161] A data path exits between each network interface and each cacheinterface. In each of these data path is a processing engine thatcontrols data movement between the interfaces as well as operations thatcan be performed on the data as it moves between the interfaces. Theseprocessing engines are data-flow driven as described above.

[0162] The processing engine components that are used to perform thesefunctions include an external bus write buffer, a feedback buffer, acache read buffer, a cache write buffer, a parity engine, and theassociated controller logic that controls these elements. The bufferelements are memories of appropriate sizes that smooth the data flowbetween the external interfaces, the parity engines, and the caches.

[0163] The data engine is used to provide a data path between clientnetwork interface and storage network interface controllers. The networkinterface controllers may support Fibre Channel, Ethernet, Infiniband,or other high performance networking protocols. One or more host CPUsschedule network transfers by queuing the data transfer operations onthe network interfaces controllers. The network interface controllersthen communicate directly with the data engine to perform the datatransfer operations, completely autonomously from any additional CPUinvolvement. The data transfer operations may require only the movementof data, or they may combine the movement of data with other operationsthat must be performed on the data in transit.

[0164] The processing engines in the data engine can perform fiveprincipal operations, as well as a variety of support operations. Theprincipal operations are: read from cache; write to cache; XOR write tocache; write to one cache with XOR write to other cache; write to bothcaches.

[0165] The data-flow control structure of the data engine reduces theloading placed on the server CPU. Once data operations are queued, theserver CPU does not need to be directly involved in the movement ofdata, in the operations that are performed on data, or the management ofa data transfer.

[0166]FIG. 1 shows a general overview of a Distributed File StorageSystem (DFSS) 100 that operates on a computer network architecture. Oneor more clients 110 operating on one or more different platforms areconnected to a plurality of servers 130, 131, 132, 133 134, 135, by wayof a communication fabric 120. In one embodiment, the communicationfabric 120 is a Local Area Network (LAN). In one embodiment, thecommunication fabric 120 is a Wide Area Network (WAN) using acommunication protocol such as, for example, Ethernet, Fibre Channel,Asynchronous Transfer Mode (ATM), or other appropriate protocol. Thecommunication fabric 120 provides a way for a client 110 to connect toone or more servers 130-135.

[0167] The number of servers included in the DFSS 100 is variable.However, for the purposes of this description, their structure,configuration, and functions are similar enough that the description ofone server 130 is to be understood to apply to all 130-135. In thedescriptions of other elements of the figure that are similarlyduplicated in the DFSS 100, a description of one instance of an elementis similarly to be understood to apply to all instances.

[0168] The server 130 is connected to a disk array 140 that stores aportion of the files of the. distributed file storage system. Together,the server-disk array pair 130,140 can be considered to be one servernode 150. The disks in the disk array 140 can be Integrated DriveElectronics (IDE) disks, Fibre Channel disks, Small Computer SystemsInterface (SCSI) disks, InfiniBand disks, etc. The present disclosurerefers to disks in the disk array 140 by way of example and not by wayof limitation. Thus, for example the “disks” can be many types ofinformation storage devices, including, for example, disk drives, tapedrives, backup devices, memories, other computers, computer networks,etc.

[0169] In one embodiment, one or more server nodes 150, 151 are groupedinto a cluster 160 of server nodes. In one embodiment, each server 130in the cluster 160 is connected not only to its own disk array 140, butalso to the disk array(s) 141 of the other server(s), 131 of the cluster160. Among other advantages conferred by this redundant connection isthe provision of alternate server paths for reading a popular file or afile on a busy server node. Additionally, allowing servers 130, 131 toaccess all disk arrays 140, 141 of a cluster 160 provides the assurancethat if one server 130 of a cluster 160 should fail, access to the fileson its associated disk array 140 is not lost, but can be providedseamlessly by the other servers 131 of the cluster 160.

[0170] In one embodiment, files that are stored on the disk array 140 ofone server node 150 are mirrored on the disk array(s) 141 of each servernode 151 in the cluster 160. In such an embodiment, if the disk array140 should become unusable, the associated server 130 will still be ableto access copies of its files on the other disk array(s) 141 of thecluster 160.

[0171] As shown in FIG. 1, the server 130 is associated with the diskarray 140 that can include multiple disk drives of various sizes andcapacities. Thus, the DFSS 100 allows for much more flexibility thanmany conventional multi-disk file storage systems that require strictconformity amongst the disk arrays of the system. Among other advantagesconferred by this flexibility is the ability to upgrade portions of thesystem hardware without having to upgrade all portions uniformly andsimultaneously.

[0172] In many conventional networked storage systems, a user on aclient needs to know and to specify the server that holds a desiredfile. In the DFSS 100 described in FIG. 1, although the files of thefile system can be distributed across a plurality of server nodes, thisdistribution does not require a user on a client system 110 to know apriori which server has a given file. That is, to a user, it appears asif all files of the system 100 exist on a single server. One advantageof this type of system is that new clusters 160 and/or server nodes 150can be added to the DFSS 100 while still maintaining the appearance of asingle file system.

[0173]FIG. 2 is a block diagram showing one embodiment 200 of the servernode 150 in the DFSS 100. As in FIG. 1, the server node 150 includes theserver 130 and the disk array 140 or other data storage device.

[0174] The server 130 includes a server software module 205. The serversoftware module 205 includes server interface (SI) software 240 forhandling communications to and from clients 110, file system (FS)software 250 for managing access, storage, and manipulation of thefiles, and a JBOD (Just a Bunch of Disks) interface (JI) 260 forhandling communications with the disk array 140 and with other diskarrays of the cluster 160. Communications between the server interface240 and the file system 250 take place using a Client Server Object 245.Communications between the file system 250 and the JBOD interface 260take place using a Disk Service Object 255. In one embodiment, asdepicted in FIG. 2, the software of the file system 250 residesprincipally on the servers 130, 131, while the file data is stored onstandard persistent storage on the disk arrays 140, 141 of the DFSS 100.

[0175] The server software module 205 also includes a polling module 270for polling clients 110 of the DFSS 100 and a polling module 280 forpolling disk arrays 140 of the DFSS 100.

[0176] In the embodiment 200 shown in FIG. 2, the server 130 includes aFibre Channel Application Programming Interface (FC-API) 210 with twoFibre Channel ports 211 for communicating via the fabric 120 with theclient 110 and with other server(s) 151 of the cluster 160. The FC-API210 also communicates with the server interface 240 and with the clientpolling module 270 in the server software module 205.

[0177] The server 130 includes an FC-API 220 with two Fibre Channelports 221 for communicating with the disk array 140 and with other diskarrays of its cluster 160. The FC-API 220 may communicate with the diskarray 140 via a communication fabric 222, as shown in FIG. 2. The FC-API220 may also communicate with the disk array 140 directly. The FC-API220 also communicates with the JBOD interface 260 and with the diskpolling module 280 in the server software module 205.

[0178] The server 130 includes an Ethernet interface 230 with twoEthernet ports 231, 232 configured to handle Gigabit Ethernet or 10/lOOTEthernet. The Ethernet interface 230 communicates with the serverinterface 240 in the server software module 205. In FIG. 2, the GigabitEthernet port 231 communicates with one or more Ethernet clients 285 ofthe DFSS 100. The Ethernet clients 285 include an installable clientinterface software component 286 that communicates with the client'soperating system and with the Ethernet interface 230 of the server node150. In FIG. 2, the Ethernet port 232 communicates with anadministrative interface system 290.

[0179] To improve performance for certain implementations, a small filesystem software layer may also exist on clients 110, as shown in theembodiment 200 shown in FIG. 2, where the client system 110 includes aninstallable software component called the Client Interface (CI) 201 thatcommunicates with both the client's operating system and, via thecommunication fabric 120, with a server node 150 of the DFSS 100.

[0180] The functions of the FC-API modules 210, 220 and the Ethernetinterface 230 may alternatively be handled by other communicationprotocols.

[0181] Overview of Metadata Structures

[0182] In order to perform normal file system operations, such as, forexample, creating and deleting files, allowing clients to read and writefiles, caching file data, and keeping track of file permissions, whilealso providing the flexibility mentioned above, a cluster 160 maintainsmetadata about the files stored on its disk arrays 140, 141. Themetadata comprises information about file attributes, file directorystructures, physical storage locations of the file data, administrativeinformation regarding the files, as well as other types of information.In various embodiments, the file metadata can be stored in a variety ofdata structures that are configured in a variety of interconnectedconfigurations, without departing from the spirit of the distributedfile system. FIG. 3 is a block diagram that shows one embodiment of aconfiguration comprising five metadata structures and connectionsbetween them. Each of these structures, the data they hold, and how thestructures are used are described in greater detail below.

[0183] Referring to FIG. 3, a Filename Table 310 includes a collectionof filenames for both files stored on the server node 150 as well asfiles that are children of directories stored on the server node 150.

[0184] A G-node Table 330 includes a collection of G-nodes, where eachG-node contains data related to attributes of a file. A one-to-onecorrespondence exists between the G-nodes and files stored on the servernode 150.

[0185] A Gee Table 320 holds data about the physical locations of thefile blocks on the disk array 140. The Gee Table 320 additionallyincludes pointers to each associated G-node in the G-node Table 330, andeach G-node in the G-node Table 330 includes a pointer to an associatedportion of the Gee Table 320.

[0186] A Gnid Table 340 on the server node 150 includes Gnid-stringsthat hold data describing the directory structure of that portion of thefile system 250 whose directories are stored on the server node 150. Aone-to-one correspondence exists between the Gnid-strings and directoryfiles stored on the server node 150. Gnid-strings are collections ofGnids, which hold information about individual files that exist within agiven directory. The file system 250 allows files within a directory tobe stored on a cluster that is different from the cluster on which theparent directory is stored. Therefore, Gnids within a Gnid-string on theserver node 150 can represent files that are stored on clusters otherthan the current cluster 160.

[0187] Each Gnid includes several pointers. A Gnid in the Gnid Table 340includes a pointer to an associated filename for the file represented bythe Gnid. Because the Filename Table 310 includes filenames for bothfiles stored on the server node 150 as well as files that are childrenof directories stored on the server node 150, all Gnids on the servernode 150 point to the Filename Table 310 on the server node 150.

[0188] A Gnid in the Gnid Table 340 includes a pointer to its parentdirectory's G-node in the G-node Table 330, and a parent directory'sG-node includes a pointer to the beginning of its associated Gnid-stringin the Gnid Table 340.

[0189] Each Gnid also includes a pointer to its own G-node. Since a Gnidcan represent a file that is stored on another cluster 160 of the filesystem 250, a pointer to the Gnid's own G-node can point to the G-nodeTable 330 on another server node of the file system 250.

[0190] A Cache Node Table 350 includes the Cache Nodes that holdinformation about the physical locations of file blocks that have beencached, including a pointer to a cache location as well as a pointer toa non-volatile location of the data on the disk array 140. A pointer toa Cache Node exists in the Gee Table 320 for every associated data blockthat has been cached. Similarly, a pointer exists in the Cache Node to alocation in the Gee Table 320 associated with a disk storage locationfor an associated data block.

[0191] Mirroring of Metadata Structures

[0192] To review the description from FIG. 1, in one embodiment, theservers 130, 131 of a cluster 160 are able to access files stored on allthe disk array(s) 140, 141 of the cluster 160. In one embodiment, allserver nodes 150, 151 of a cluster 160 have copies of the same FilenameTable 310, Gee Table 320, G-node Table 330, and Gnid Table 340.

[0193] In embodiments where files, as well as metadata, are mirroredacross the server nodes 150, 151 of a cluster 160, a different Gee Table320 exists for each disk array 140, 141 within a cluster 160, since theGee Table 320 holds information about the physical storage locations ofthe files on a given disk array, and since the disk arrays 140, 141within a given cluster 160 are not constrained to being identical incapacity or configuration. In such an embodiment, the servers 130, 131within the cluster 160 have copies of both the Gee Table 320 for a firstdisk array 140 and the Gee Table 320 for each additional disk array 141of the cluster.

[0194] In one embodiment, in order to enhance both the security of themetadata and efficient access to the metadata, each server node 150, 151stores a copy of the Filename Table 310, the G-node Table 330, the GnidTable 340, and the Gee Table 320 in both non-volatile memory (forsecurity) and in volatile memory (for fast access). Changes made to thevolatile versions of the metadata structures 310, 320, 330, 340 areperiodically sent to the non-volatile versions for update.

[0195] In one embodiment, the server nodes 150, 151 in the cluster 160do not have access to one another's cache memory. Therefore, unlike thefour metadata structures 310, 320, 330, and 340 already described, theCache Node Table 350 is not replicated across the server nodes 150, 151of the cluster 160. Instead, the Cache Node Table 350 stored in volatilememory on a first server 130 refers to the file blocks cached on thefirst the server 130, and the Cache Node Table 350 stored in volatilememory on a second server 131 refers to file blocks cached on the secondserver 131.

[0196] Division of Metadata Ownership

[0197] In one embodiment, the metadata structures described in FIGS. 3are duplicated across the server nodes 150, 151 of the cluster 160,allowing access to a set of shared files and associated metadata to allservers in the cluster 160. All of the server nodes 150, 151 in thecluster 160 can access the files stored within the cluster 160, and allare considered to be “owners” of the files. Various schemes can beemployed in order to prevent two or more servers 130, 131 from alteringthe same file simultaneously. For example, in embodiments where thecluster 160 includes two server nodes 150 and 151, one such scheme is toconceptually divide each of the duplicated metadata structures in halfand to assign write privileges (or “primary ownership”) for one half ofeach structure to each server node 150, 151 of the cluster 160. Only theserver node 150 that that is primary owner of the metadata for aparticular file has write privileges for the file. The other servernode(s) 151 of the cluster 160 are known as “secondary owners” of thefile, and they are allowed to access the file for read operations.

[0198] In a failure situation, when the server 130 determines that itscounterpart 131 is not functional, the server 130 can assume primaryownership of all portions of the metadata structures 310, 320, 330, 340and all associated files owned by the server 131, thus allowingoperation of the file system 250 to continue without interruption. Inone embodiment, if a server in cluster 160 having more than two serversexperiences a failure, then primary ownership of the failed server'sfiles and metadata can be divided amongst the remaining servers of thecluster.

[0199] Filename Table

[0200]FIG. 4 shows a sample portion of the Filename Table 310. In oneembodiment, the Filename Table 310 on the server 130 contains FilenameEntries 410, 420, 430, 440 for files which are either stored in the diskarray 140 or are parented by a directory file in the disk array 140. Inone embodiment, the Filename Table 310 is stored as an array. In FIG. 4,a ‘Start of String’ (SOS) marker 411 marks the beginning of the FilenameEntry 410, and a character string 414 holds characters of the filename,“Doe.” In one embodiment, a checksum 412 for the string 414 is alsoincluded in the Filename Entry 410. In one embodiment, a filename lengthcount 413 representing the length of the string 414, shown in FIG. 4 tohave a value of “3,” is included in the Filename Entry 410. The checksum412 and the filename length count 413 advantageously allow for anexpedited search of the Filename Table 310.

[0201] A ‘Start of String’ (SOS) marker 421 marks the beginning of theFilename Entry 420 with a checksum 422, a filename length count 423 of“6,” and a character string 424 holding the filename “Thomas.”

[0202] A ‘Deleted String’ (DS) marker 431 marks the beginning of theFilename Entry 430 with a checksum 432, a filename length count 433 of“4,” and a character string 434 holding the filename “Frog.”

[0203] A ‘Start of String’ (SOS) marker 441 marks the beginning of theFilename Entry 440 with a checksum 442, a filename length count 443 of“2,” and a character string 444 holding the filename “It.”

[0204] Comparing the checksums 412, 422, 432, 442 and the filenamelength counts 413, 423, 433, 443 of each Filename Entry 410, 420, 430,440 to those calculated for a desired filename provides a quick way toeliminate most Filename Entries in the Filename Table 310 before havingto make a character-by-character comparison of the character strings414, 424, 434, 444.

[0205] Another advantage of including the filename length counts 413,423, 433, 443 applies when deleting a Filename Entry 410, 420, 430, 440from the Filename Table 310. Replacing the ‘Start of String’ (SOS)marker 411, 421, 441 with a ‘Deleted String’ (DS) marker 431, as in theFilename Entry 430, signals that the corresponding file is no longerstored on the disk array 140, even if the remainder of the FilenameEntry 432-434 remains unchanged. The filename length 433 accuratelyrepresents the length of the “deleted” string 434, and when a newfilename of the same length (or shorter) is to be added to the table310, the new name and checksum (and filename length count, if necessary)can be added into the slot left by the previous filename.

[0206] Gee Table

[0207] The file system 250 divides files into one or more file logicalblocks for storage. Each file logical block is stored in a cluster ofone or more disk logical blocks on the disk array 140. Although the filesystem 250 retains many of the advantages of a conventional file systemimplemented on RAID (Redundant Array of Independent Disks), includingthe distribution of files across multiple disk drives and the use ofparity blocks to enhance error checking and error correcting, unlikemany RAID systems, the file system 250 does not restrict file logicalblocks to one uniform size. File logical blocks of data and paritylogical blocks can be the size of any integer multiple of a disk logicalblock. This variability of file logical block size allows forflexibility in allocating disk space and, thus, for optimized use ofsystem resources.

[0208] In the file system 250, the size of a file logical block isdescribed by its integer multiple, called its extent, in disk logicalblocks. For example, a file logical block with an extent of 3 is storedin a cluster of 3 disk logical blocks on the disk array 140.

[0209] The Gee Table 320 stores metadata describing the disk logicalblock locations on the disk array 140 for each file logical block of thefiles.

[0210]FIG. 5 shows one embodiment of a Gee Table 320 that is implementedas a flat array. Each indexed row 510-529 of the Gee Table 320 is calleda Gee. In FIG. 5, Gees 510-528 relate to a single file that is dividedinto ten file logical blocks. Such a set of Gees 510-528, which togetherdescribe the logical location of a single file on the disk array 140, isknown as a Gee-string 500. A Gee-string is made up of one or moreGee-groups. Each Gee-group is a set of contiguous Gees that all relateto a single file. In FIG. 5, the Gee-string 500 includes threeGee-groups, 550, 551, and 552. The Gee 529 relates to a separate file,as will be explained in more detail below.

[0211] In one embodiment, the Gees 510-529 include a G-code field 590and a Data field 591. The G-code field 590 in the Gees 510-529 indicatesthe type of data that is included in the Data field 591. In FIG. 5, fourtypes of G-codes 590 are depicted: “G-NODE,” “DATA,” “PARITY,” and“LINK.”

[0212] In one embodiment, the G-code 590 of “G-NODE” indicates that theGee is a first Gee of a Gee-group. For example, the first Gee of theGee-group 550 is a G-NODE Gee 510. Similarly, the first Gee of theGee-groups 551 and 552 are also G-NODE Gees 520, 525.

[0213] The Data field 591 of a G-NODE Gee can include a pointer to thefile's G-node in the G-node Table 330 and information about whether thisis the first (or Root) G-NODE Gee of the file's Gee-string 500. The Datafield 591 of a G-NODE Gee can also include information about the extent,or size, of the logical disk block clusters for the file logical blocksof the Gee-group, as will be described in greater detail below.

[0214] In FIG. 5, the Data fields 591 of the G-NODE Gees 510, 520, and525 contain a reference to G-node index “67,” indicating that they allrelate to the file associated with the G-node at index “67” of theG-node Table 330. That is, they all relate to portions of the same file.The Data field 591 of the Gee 529 refers to the G-node index “43,”indicating that it relates to a different file.

[0215] Of the G-NODE Gees 510, 520, 525, only the first Gee 510 containsan indication that it is a Root Gee, meaning that it is the first Gee ofthe Gee-string 500. The Gee 529 is a G-NODE Gee, indicating that it is afirst Gee of a Gee-group (the remainder of which is not shown), and theData field 591 of the Gee 529 also indicates that the Gee 529 is not aRoot Gee for its Gee-string.

[0216] Following the G-NODE Gee in a Gee-group are Gees representing oneor more Distributed Parity Groups (DPGs) 560, 561, 52, 563. A DPG is setof one or more contiguous DATA Gees followed by an associated PARITYGee. A DATA Gee is a Gee with a G-code 590 of “DATA” that lists disklogical block(s) where a file logical block is stored. For example, inFIG. 5, the Gees 511-513, 515-517, 521-522, and 526-527 are all DATAGees, and each is associated with one file logical block 592.

[0217] A PARITY Gee is a Gee with a G-code 590 of “PARITY.” Each PARITYGee lists disk logical block location(s) for a special type of filelogical block that contains redundant parity data used for errorchecking and error correcting one or more associated file logicalblocks. A PARITY Gee is associated with the contiguous DATA Gees thatimmediately precede the PARITY Gee. A set of contiguous DATA Gees andthe PARITY Gee that follows them are known collectively as a DistributedParity Group 560, 561, 562, 563.

[0218] For example, in FIG. 5, the PARITY Gee 514 is associated with theDATA Gees 510-513, and together they form the Distributed Parity Group560. Similarly, the PARITY Gee 518 is associated with the DATA Gees515-517, and together they form the Distributed Parity Group 561. ThePARITY Gee 523 is associated with the DATA Gees 521-522, which togetherform the Distributed Parity Group 562, and the PARITY Gee 528 isassociated with the DATA Gees 526-527, which together form theDistributed Parity Group 563.

[0219] The size of a disk logical block cluster described by a DATA Geeor a PARITY Gee, as measured in number of disk logical blocks, matchesthe extent listed in the previous G-NODE Gee. In the example of FIG. 5,the G-NODE Gee 510 defines an extent size of 2, and each DATA and PARITYGee 511-518 of the two Distributed Parity Groups 560, 561 of theGee-group 550 lists two disk logical block locations. Similarly, G-NODEGee 520 of the second Gee-group 551 defines an extent size of 3, andeach DATA and PARITY Gee 521-523 of the Gee-group 551 lists three disklogical block locations. G-NODE Gee 525 of the third Gee-group 552defines an extent size of 3, and each DATA and PARITY Gee 526-528 of theGee-group 552 lists three disk logical block locations.

[0220] If a Gee-group is not the last Gee-group in its Gee-string, thena mechanism exists to logically link the last Gee in the Gee-group tothe next Gee-group of the Gee-string. LINK Gees 519, 524 have the G-code590 of “LINK” and a listing in their respective Data fields 591 thatprovides the index of the next Gee-group of the Gee-string 500. Forexample, the Gee 519 is the last Gee of Gee-group 550, and its Datafield 591 includes the starting index “76” of the next Gee-group 551 ofthe Gee-string 500. The Gee 524 is the last Gee of Gee-group 551, andits Data field 591 includes the starting index “88” of the nextGee-group 552 of the Gee-string 500. Since the Gee-group 552 does notinclude a LINK Gee, it is understood that Gee-group 552 is the lastGee-group of the Gee-string 500.

[0221] A G-code 590 of “FREE” (not shown in FIG. 5) indicates that theGee has never yet been allocated and has not been associated with anydisk logical location(s) for storing a file logical block. A G-code 590of “AVAIL” (not shown in FIG. 5) indicates that the Gee has beenpreviously allocated to a cluster of disk logical block(s) for storing afile logical block, but that the Gee is now free to accept a newassignment. Two situations in which a Gee is assigned the G-code of“AVAIL” are: after the deletion of the associated file logical block;and after transfer of the file to another server in order to optimizeload balance for the file system 250.

[0222] A G-code of “CACHE DATA” indicates that the disk logical blockcluster associated with the Gee (which was previously a DATA Gee) hasbeen cached. A G-code of “CACHE PARITY” indicates that the disk logicalblock cluster associated with this Gee (which was previously a PARITYGee) has been cached. The CACHE DATA and CACHE PARITY G-codes will bedescribed in greater detail when Cache Nodes and the Cache Node Tableare described in connection with FIG. 8A below.

[0223] G-node Table

[0224] The G-node Table 330 is a collection of G-nodes, where eachG-node includes attribute information relating to one file. Attributeinformation can include, but is not restricted to: information aboutphysical properties of the file (such as, for example, its size andphysical location on disk); information about the file's relationshipsto other files and systems (such as, for example, permissions associatedwith the file and server identification numbers for the primary andsecondary owners of the file); and information about access patternsassociated with the file (such as, for example, time of the last fileaccess and time of the last file modification).

[0225] In addition to file attribute information, a G-node provideslinks to the root Gee and a midpoint Gee of the file's Gee-string in theGee Table 320. If the file is a directory file, its G-node also containsa pointer to the beginning of the Gnid-string that describes the filescontained in the directory, as will be explained with reference to FIG.7 below.

[0226] In one embodiment, the G-node Table 330 is implemented as a flatarray.

[0227]FIG. 6 shows one embodiment of information that can be included ina G-node 600. A File Attribute-type field 602 designates a file asbelonging to a supported file type. For example, in one embodiment,NFNON indicates that the G-node is not currently associated with a file,NFREG indicates that the associated file is a regular file, NFDIRindicates that the associated file is a directory, NFLINK indicates thatan associated file is a symbolic link that points to another file.

[0228] A File Attribute-mode field 604 gives information regardingaccess permissions for the file.

[0229] A File Attribute-links field 606 designates the number ofdirectory entries for a file in the file system 250. This number can begreater than one if the file is the child of more than one directory, orif the file is known by different names within the same directory.

[0230] A File Attribute-uid field 608 designates a user ID for a file'suser/owner.

[0231] A File Attribute-gid field 610 designates a group ID of a file'suser/owner.

[0232] A File Attribute-size field 612 designates a size in bytes of agiven file.

[0233] A File Attribute-used field 614 designates an amount of diskspace used by a file.

[0234] A File Attribute-fileId field 620 designates a file ID.

[0235] A File Attribute-atime field 622 designates the time of the lastaccess to the file.

[0236] A File Attribute-mtime field 624 designates the time of the lastmodification to the file.

[0237] A File Attribute-ctime field 626 designates the time of the lastmodification to a G-node (excluding updates to the atime field 622 andto the mtime field 624).

[0238] If a file is a directory file rather than a data file, then itsChild Gnid Index field 628 is an index for the oldest child in anassociated Gnid-string (to be described in greater detail with referenceto FIG. 7 below); otherwise, this field is not used.

[0239] A Gee Index-Last Used field 630 and a Gee Offset-Last Used field631 together designate a location of a most recently accessed Gee 510for a given file. These attributes can be used to expedite sequentialreading of blocks of a file.

[0240] A Gee Index-Midpoint field 632 and a Gee Offset-Midpoint field633 together point to a middle Gee 510 of the Gee-string 500. Searchingfor a Gee for a given file block can be expedited using these two fieldsin the following way: if a desired block number is greater than theblock number of the midpoint Gee, then sequential searching can begin atthe midpoint of the Gee-string 500 rather than at its beginning.

[0241] A Gee Index-Tail field 634 and a Gee Offset-Tail field 635together point to the last Gee 528 of the Gee-string 500. New data caneasily be appended to the end of a file using the pointers 634 and 635.

[0242] A Gee Index-Root field 636 is an index of the root Gee 510 of aGee-string for an associated file.

[0243] A G-node Status field 638 indicates whether the G-node is beingused or is free for allocation.

[0244] A Quick Shot Status field 640 and a Quick Shot Link field 642 areused when a “snapshot” of the file system 250 is taken to allow foronline updates and/or verification of the system that does not interruptclient access to the files. During a “snapshot,” copies of some portionsof the system are made in order to keep a record of the system's stateat one point in time, without interfering with the operation of thesystem. In some embodiments, more than one Quickshot can be maintainedat a given time. The Quick Shot Status field 640 indicates whether theG-node was in use at the time of the “snapshot” and, therefore, if ithas been included in the “snapshot.” If the G-node has been included inthe “snapshot,” the Quick Shot Link field 642 provides a link to thenewly allocated copy of the G-node.

[0245] In one embodiment, a bit-mask is associated with each elementwith the file system 250 identifying any of a number of Quickshotinstances to which the element belongs. When a Quickshot is requested, atask can set the bit for every element, holding the file system at bayfor a minimum amount of time. Thus, capturing the state of a file systemcomprises identifying elements in the file system as being protected,rather than actually copying any elements at the time of the Quickshot.

[0246] In one embodiment, the file system uses a copy-on-write mechanismso that data is not overwritten; new blocks are used for new data, andthe metadata is updated to point to the new data. Thus, a minimum ofoverhead is required to maintain a Quickshot. If a block is beingwritten and the file system element being modified has a bit setindicating that it is protected by a Quickshot, the metadata is copiedto provide a Quickshot version of the metadata, which is distinct fromthe main operating system. Then, the write operation continues normally.

[0247] Gnid Table

[0248] Files in the file system 250 are distributed across a pluralityof server nodes 150 while still appearing to clients 110 as a singlefile system. According to different embodiments, files can bedistributed in a variety of ways. Files can be distributed randomly, oraccording to a fixed distribution algorithm, or in a manner thatenhances load balancing across the system, or in other ways.

[0249] In one embodiment, the files of a given directory need not bestored physically within the same cluster as the cluster that stores thedirectory file itself. Nor does one large table or other data structureexist which contains all directory structure information for the entirefile system 250. Instead, directory structure information is distributedthroughout the file system 250, and each server node 150 is responsiblefor storing information about the directories that it stores and aboutthe child files of those directories.

[0250] In one embodiment, server nodes of the DFSS 100 hold directorystructure information for only the directory files that are stored onthe server node and for the child files of those directories, that is,the files one level down from the parent directory. In anotherembodiment, server nodes of the DFSS 100 hold directory structureinformation for each directory file stored on the server node and forfiles from a specified number of additional levels below the parentdirectory in the file system's directory structure.

[0251] In one embodiment, an exception to the division of responsibilitydescribed above is made for the directory structure information for a“root” directory of the file system 250. The “root” directory is adirectory that contains every directory as a sub-directory and, thus,every file in the file system 250. In this case, every server in thefile system 250 can have a copy of the directory structure informationfor the “root” directory as well as for its own directories, so that asearch for any file of unknown location can be initiated at the “root”directory level by any server of the file system 250. In anotherembodiment, the directory structure information for the “root” directoryis stored only in the cluster that stores the “root” directory, andother clusters include only a pointer to the “root” directory.

[0252] The Gnid Table 340 on the server node 150 defines a structure fordirectory files that reside on the server node 150. The Gnid Table 340comprises Gnid-strings, which, in one embodiment, are linked listsimplemented within a flat array. In one embodiment, a Gnid-string existsfor each directory file on the server node 150. Individual elements of aGnid-string are called Gnids, and a Gnid represents a child file of agiven parent directory.

[0253]FIG. 7 shows the structure of one embodiment of a Gnid-string 700.In this embodiment, the Gnid-string 700 for a directory file is a linkedlist of Gnids 710-713, where each Gnid represents one file in thedirectory. In one embodiment, in order to expedite searching theGnid-string 700 for a given Gnid, the Gnids are kept in ascending orderof the checksums 412, 422, 442 of the files' filenames 410, 420, 440,such that the Gnid with the smallest checksum is first in theGnid-string 700. When a new file is added to a directory, a Gnid for thenewly added file is inserted into the appropriate location in theGnid-string 700. Search algorithms that increase the efficiency of asearch can exploit this sorted arrangement of Gnids 710-713 within aGnid-string 700.

[0254] Since Gnids share a common structure, a description of one Gnid710 is to be understood to describe the structure of all other Gnids711-713 as well.

[0255] The Gnid 710 includes, but is not restricted to, seven fields720, 730, 740, 750, 760, 770, and 780. A Status field 720 indicateswhether the Gnid 710 is a first Gnid (GNID_OLDEST) in the Gnid-string700, a last Gnid (GNID_YOUNGEST) in the Gnid-string 700, a Gnid that isneither first nor last (GNID_SIBLING) in the Gnid-string 700, or a Gnidthat is not currently in use (GNID_FREE).

[0256] A Parent G-node Ptr field 730 is a pointer to the G-node for thefile's parent directory in the G-node Table 330.

[0257] A Sibling Gnid Ptr field 740 is a pointer to the next Gnid 711 onthe Gnid-string 700. In the embodiment described above, the Sibling GnidPtr field 740 points to the Gnid within the Gnid-string 700 that has thenext largest checksum 412, 422, 442 value. A NULL value for the SiblingGnid Ptr field 740 indicates that the Gnid is the last Gnid of theGnid-string 700.

[0258] A G-node Ptr field 750 is a pointer to the file's G-node 600,indicating both the server node that is primary owner of the file andthe file's index into the G-node Table 330 on that server node.

[0259] A Filename Ptr field 760 is a pointer to the file's FilenameEntry in the Filename Table 310.

[0260] A ForBiGnid Ptr field 770 is a pointer used for skipping ahead inthe Gnid-string 700, and a BckBiGnid Ptr field 780 is a pointer forskipping backward in the Gnid-string 700. In one embodiment, the fields770 and 780 can be used to link the Gnids into a binary tree structure,or one of its variants, also based on checksum size, thus allowing forfast searching of the Gnid-string 700.

[0261] Cache Node Table

[0262] The Cache Node Table 350 stores metadata regarding which datablocks are currently cached as well as which data blocks have been mostrecently accessed. The Cache Node Table 350 is integrated with the filesystem 250 by way of a special type of Gee 510 in the Gee Table 320.When a data block is cached, a copy of its associated DATA Gee 511-513,515-517, 521-522, 526-527, which describes the location of the data onthe disk array 140, is sent to the Cache Node Table 350, where it isheld until the associated data is released from the cache. Meanwhile,the DATA Gee 511-513, 515-517, 521-522, 526-527 in the Gee Table 320 ismodified to become a CACHE DATA Gee; its G-Code 590 is changed from DATAto CACHE DATA, and instead of listing a data block's location on disk140, the Data field 591 of the Gee now indicates a location in the CacheNode Table 350 where a copy of the original DATA Gee 511-513, 515-517,521-522, 526-527 was sent and where information about the data block'scurrent location in cache can be found.

[0263] In one embodiment, the Cache Node Table 350 is implemented as alist of fixed length Cache Nodes, where a Cache Node is associated witheach Gee 511-513, 515-517, 521-522, 526-527 whose data has been cached.The structure of one embodiment of a Cache Node 800 is described in FIG.8A.

[0264] Referring to FIG. 8A, the Cache Node 800 is shown to include ninefields. A Data Gee field 810 is a copy of the DATA Gee 511-513, 515-517,521-522, 526-527 from the Gee Table 320 that allows disk locationinformation to be copied back into the Gee Table 320 when the associateddata block is released from cache. A PrevPtr field 815 holds a pointerto the previous Cache Node in the Cache Node Table 350. A NextPtr field820 holds a pointer to the next Cache Node in the Cache Node Table 350.In one embodiment, the Cache Node Table 350 is implemented as a flatarray, in which case the PrevPtr 815 and NextPtr 820 fields can holdindices of a previous and a next item in the table. A CacheBlockAddrfield 825 holds a pointer to a location in cache where the associateddata has been cached. A ReadCt field 830 is a counter of the number ofclients currently reading the associated data block. A CacheTime field835 holds a time that the associated cache contents were last updated. ARegenerated field 840 holds a flag indicating that the associated cachecontents have been regenerated. A CacheBlockHiAddr field 845 and aCacheBlockLoAddr field 850 hold a “high water mark” and “low water mark”of the data in a cache block. These “water marks” can be used todemarcate a range of bytes within a cache block so that if a writeoperation has been performed on a subset of a cache block's bytes, thenwhen the new data is being written to disk, it is possible to copy onlyrelevant or necessary bytes to the disk.

[0265] In one embodiment, the Cache Node Table 350 is conceptuallydivided into three lists, as depicted in FIG. 8B. A Normal List 860includes all the Cache Nodes 800 in the Cache Node Table 350 which areassociated with cached data that is not currently in use. A Write List865 holds the Cache Nodes 800 of data blocks that have been modified andthat are waiting to be written to disk. A Read List 870 holds the CacheNodes 800 of data blocks that are currently being read by one or moreclients.

[0266] When existing cached data is needed for a write or a readoperation, the associated Cache Node 800 can be “removed” from theNormal List 860 and “linked” to the Write List 865 or the Read List 870,as appropriate. The Cache Nodes 800 in each of the lists 860, 865, 870can be linked by using the PrevPtr 815 and NextPtr 820 fields. The CacheNodes 800 of data blocks that are being written to can be “moved” fromthe Normal List 860 to the Write List 865 until an associated data blockstored on the disk array 140 is updated. The Cache Nodes 800 of datablocks that are being read can be similarly “moved” to the Read list byresetting the links of the PrevPtr 815 and NextPtr 820 fields.

[0267] The Cache Nodes 800 of data blocks that are being read canadditionally have their ReadCt field 830 incremented, so that a countmay be kept of the number of clients currently reading a given datablock. If additional clients simultaneously read the same file, theserver 130 increments the Cache Node's ReadCt field 830 and the CacheNode 800 can stay in the Read List 870. As each client finishes reading,the ReadCt 830 is appropriately decremented. When all clients havefinished reading the file block and the ReadCt field 830 has beendecremented back to a starting value, such as 0, then the Cache Node 800is returned to the Normal List 860.

[0268] In one embodiment, the server 130 that wishes to access anexisting Cache Node 800 for a read or a write operation can “take” thedesired Cache Node 800 from any position in the Normal List 860, asneeded. The Cache Nodes 800 from the Write List 865 whose associateddata have already been written to disk are returned to a “top” position875 of the Normal List 860. Similarly, when no clients are currentlyreading the cached data associated with a given the Cache Node 800 onthe Read List 870, the Cache Node 800 is returned to the “top” position875 of the Normal List 860. In this way, a most recently accessed CacheNode 800 amongst the Cache Nodes 800 on the Normal List 860 will be atthe “top” position 875, and a least recently accessed the Cache Node 800will be at a “bottom” position 880.

[0269] In one embodiment, if space in the cache is needed for a new datablock when all of the Cache Nodes 800 have been assigned, then the CacheNode 800 in the “bottom” position 880 is selected to be replaced. To doso, the cached data associated with the “bottom” Cache Node 880 can bewritten to a disk location specified in the DataGee field 810 of the“bottom” Cache Node 880, and the DataGee 810 from the “bottom” CacheNode 880 is returned to its location in the Gee Table 320. The “bottom”Cache Node 880 can then be overwritten by data for a new data block.

[0270] In one embodiment, the server nodes 150, 151 in the cluster 160do not have access to one another's cache memory. Therefore, unlike themetadata structures described in FIGS. 4-7, the Cache Node Table 350 isnot replicated across the servers 130, 131 of the cluster 160.

[0271] Lock Nodes and Refresh Nodes

[0272] In addition to the metadata structures described above inconnection with FIGS. 3-8, other metadata structures can be used toenhance the security and the efficiency of the file system 250. Twometadata structures, a Lock Node Table and a Refresh Node Table, assistwith the management of “shares” and “locks” placed on the files of theserver node 150. A share or a lock represents a client's request tolimit access by other clients to a given file or a portion of a file.Depending on its settings, as will be described in greater detail below,a share or a lock prevents other client processes from obtaining orchanging the file, or some portion of the file, while the share or lockis in force. When a client requests a share or a lock, it can either begranted, or, if it conflicts with a previously granted share or lock, itcan be given a “pending” status until the original share or lock iscompleted.

[0273] Information about current shares and locks placed on a servernode's files is stored in a Lock Node Table. A Lock Node Table includesLock Strings, where each Lock String describes the current and pendingshares and locks for a given file.

[0274]FIG. 9 shows the structure of one embodiment of a Lock String 900.The Lock String 900 includes five nodes 911,912, 921, 922, and 923. Thefirst two nodes 911 and 912 are Share Nodes 910. The next three nodes921-923 are Lock Nodes 920. As shown in FIG. 9, in one embodiment, ShareNodes 910 precede Lock Nodes 920 in the Lock String 900.

[0275] The Share Nodes 910 have eight fields 930-937, and the Lock Nodes920 have ten fields 930-933 and 938-943. In FIG. 9, the first fourfields of both the Share Nodes 910 and the Lock Nodes 920 are the same,and as such, a description of one shall be understood to apply to bothShare Nodes and Lock Nodes.

[0276] A lockStatus field 930 indicates whether the node is of typeSHARE or LOCK, or if it is currently an unused FREE node. A SHARE noderepresents a current or pending share request. A share applies to anentire file, and, if granted, it specifies the read and writepermissions for both a requesting client and for all other clients inthe system. A LOCK node represents a current or pending lock request. Alock applies to a specified byte range within a file, and, if granted,it guarantees that no other client process will be able to access thesame range to write, read or read/write, depending on the values in theother fields, while the lock is in effect.

[0277] A timeoutCt field 931 helps to ensure that locks and shares arenot inadvertently left in effect past their intended time, due to error,failure of a requesting client process, or other reason. Locksautomatically “time out” after a given length of time unless they are“refreshed” periodically.

[0278] A next field 932 points to the next node in the Lock String 900.A pending field 933 indicates whether the lock or share represented bythe node is active or pending.

[0279] The fields 934-937 of FIG. 9 contain additional informationuseful to the Share Nodes 910. An access field 935 indicates the kind ofaccess to the file that the client desires. In one embodiment, theaccess field 935 may take on one of four possible values: 0 indicatesthat no access to the file is required; 1 indicates that read onlyaccess is required; 2 indicates that only write access is required; and3 indicates that read and write access to the file are both required.

[0280] A mode field 934 indicates the level of access to the file thatanother client process will be permitted while the share is in effect.In one embodiment, the mode field 934 can take on one of four possiblevalues: 0 indicates that all access by other client processes ispermitted; 1 indicates that access to read the file is denied to otherclient processes; 2 indicates that access to write to the file is deniedto other client processes; and 3 indicates that both read and writeaccess are denied to other client processes.

[0281] A clientID field 936 identifies the client that requested theshare. A uid field 937 identifies the user on the client that hasrequested the share or lock.

[0282] Fields 938-943 of FIG. 9 contain additional information useful toLock Nodes 920. An offset field 938 indicates the starting point of thebyte range within the file where the lock is in effect. A length field939 indicates the length of the segment (beginning at the offset point)that is affected by the lock. In one embodiment, Lock Nodes 920 are keptordered within the Lock String 900 according to their offset field 938.

[0283] An exclusive field 940 indicates whether the lock is exclusive ornon-exclusive. An exclusive lock, sometimes called a write lock, is usedto guarantee that the requesting process is the only process with accessto that part of the file for either reading or writing. A non-exclusivelock, often called a read lock, is used to guarantee that no one elsemay write to the byte range while the requesting the process is usingit, although reading the file is permitted to other clients.

[0284] A clientID field 941 identifies the client that requested thelock. A uid field 942 identifies the user on the client that isrequesting the lock. A svid field 943 identifies the process that isrequesting the lock.

[0285] In one embodiment, a Refresh Node Table is used to detect clientswho hold locks or shares on files and who are no longer in communicationwith the DFSS 100. A Refresh Node is created for each client thatregisters a lock or share. FIGS. 10 and 11 depict examples of howRefresh Nodes can be configured as a binary tree and as a doubly-linkedlist, respectively. Based on the task at hand and on the links used fortraversal, both structures can exist simultaneously for the same set ofRefresh Nodes, as will be explained in greater detail below.

[0286] Referring to FIG. 10, six Refresh Nodes 1000, 1010, 1020, 1030,1040, and 1050 are shown configured as a binary tree. The structure ofeach Refresh Node is the same, and it is to be understood that adetailed description of one Refresh Node 1000 applies also to the otherRefresh Nodes 1010, 1020, 1030, 1040 of FIG. 10. In one embodiment, theRefresh Node 1000 includes six fields. A clientID field 1001 identifiesa client who has registered at least one current lock or share. Acounter field 1002 maintains a counter that, in one embodiment, isoriginally set to a given start value and is periodically decrementeduntil a “refresh” command comes from the client to request that thecounter be returned to its full original value. If the counter field1002 is allowed to decrement to a specified minimum value before a“refresh” command is received from the identified client 1001, then alllocks and shares associated with the client 1001 are considered to have“timed out,” and they are removed from their respective Lock Strings900.

[0287] In one embodiment, Refresh Nodes are allocated from a flat arrayof Refresh Nodes. The Refresh Nodes can be linked and accessed in avariety of ways, depending on the task at hand, with the help of pointerfields located in each node. For example, when a “refresh” commandarrives from the client 110, it is advantageous to be able to quicklylocate the Refresh Node 1000 with the associated clientID field 1001 inorder to reset its counter field 1002. A binary tree structure, as shownin the example of FIG. 10, can allow for efficient location of theRefresh Node 1000 with the given clientID field 1001 value if the nodesof the tree are organized based on the clientID field 1001 values. Insuch a case, a left link field 1003 (ltLink) and a right link field 1004(rtLink), pointing to the Refresh Node's left and right child,respectively, provide links for traversal of the tree using conventionalalgorithms for traversing a binary tree.

[0288] In one embodiment, unused Refresh Nodes 1100, 1110, 1120, 1130 inthe flat array are kept in a doubly-linked Free List, such as the onedepicted in FIG. 11, for ease of allocation and de-allocation. In oneembodiment, used Refresh Nodes are kept in a doubly-linked list, calleda Used List. With this structure, decrementing the counter field 1002 ofeach Refresh Node that is currently in use can be carried outefficiently. In FIG. 11, a stackNext field 1105 and a stackprev field1106 of the Refresh Node 110 together allow for doubly-linked traversalof the Refresh Nodes of the Free List and the Used List. When a newRefresh Node is needed, it can be removed from the Free List and linkedto both the Used List and the binary tree by the appropriate setting ofthe link fields 1003, 1004, 1105, and 1106.

[0289] Intent Log

[0290] In one embodiment, the Filename Table 310, the G-node Table 330,the Gee Table 320 and the Gnid Table 340 are cached as well as beingstored on the disk array 140. In one embodiment, when the server 130changes a portion of the metadata in cache, an entry is made into anIntent Log in non-volatile memory, such as flash memory orbattery-backed RAM. The Intent Log Entry documents the intention toupdate both the version of the metadata stored on the disk array 140 andany mirrored version(s) of the metadata on other server nodes 151 of thecluster 160. The Intent Log provides protection against inconsistenciesresulting from a power loss before or during an update.

[0291] The following is a list of steps that show the general use of theIntent Log:

[0292] 1. Cached metadata is updated at the time of the original change.

[0293] 2. An intention to update the disk version of the metadata is putinto the Intent Log.

[0294] 3. A copy of the intention is transmitted to other server nodesof the cluster.

[0295] 4. The intention to write metadata to disk on the first servernode is executed.

[0296] 5. The intention to write metadata to disk on the other servernodes is executed.

[0297] 6. The Intent Log Entry on the first server is deleted.

[0298] 7. Notice of the first server's Intent Log Entry is sent to theother server nodes.

[0299]FIG. 12 shows the structure of an Intent Log Entry 1200. In oneembodiment, the Entry 1200 includes seven fields. A status field 1210designates whether the intention is FREE, WAITING, or ACTIVE. AnintentType field 1220 designates the type of metadata that is to beupdated. For example, the update may apply to a G-node, a Gnid, a Gee, aFilename Entry, or to a file's last access time (aTime). AgoalBufferIndex field 1230 points to an entry in a Goal Buffer that isused to verify the update. Field 1240 is a spare field that helps alignthe fields to a 64 bit boundary. A driveSector field 1250 and a drivefield 1260 identify the location on disk where the update is to be made.An intentData field 1270 holds the data of the update.

[0300] File Handle

[0301] A file handle is provided to clients by the DFSS 100 for use whenrequesting access to a file. Each file handle uniquely identifies onefile. The DFSS 100 treats both normal data files and directories asfiles, and provides file handles for both. In the description thatfollows, the term “file” may apply to either a data file or a directoryfile, unless specifically limited in the text.

[0302]FIG. 13 shows the structure of one embodiment of a file handle1300 as a 32-bit number with three fields. A Recommended NIC field 1310indicates which of a server's Network Interface Connections (NICs) isrecommended for accessing the file associated with the file handle 1300.Fibre Channel typically provides two ports per server; accordingly, inone embodiment, the Recommended NIC field 1310 is one bit in size.

[0303] A ServerID field 1320 identifies, by means of a serveridentification number (ServerID), the primary owner of the associatedfile. The inclusion of the file owner's ServerID 1320 in the file handle1300 enables a user on the client 110 to access a file in thedistributed file system 250 without needing to knowing explicitly whichserver node is holding the desired file. Using the file handle 1300 torequest a file from the file system software 250 allows the file systemsoftware 250 to direct the request to the appropriate server. Bycontrast, conventional UNIX file handles do not include informationregarding the server storing a file, and they are therefore not able toaccommodate the level of transparent file access provided in the filesystem software 250.

[0304] In one embodiment, clusters 160 include only two server nodes150, 151, and the ServerID of the file's secondary owner can be obtainedby “flipping” the least significant bit of the field 1320. This abilityis useful when the primary owner 150 is very busy and must issue a“retry later” response to a client's request to read a file. In return,the client 110 can temporarily change the ServerID in the file's filehandle 1300 and re-send the read request to the file's secondary owner151. Similar accommodations can be made for clusters of more than twoserver nodes.

[0305] A G-node Index field 1330 provides an index into the file'sG-node in the G-node Table 330 on the server identified in the ServerIDfield 1320.

[0306] In one embodiment, the file handle for a given file does notchange unless the file is moved to another server node or unless itsG-node location is changed. Thus, the file handle is relativelypersistent over time, and clients can advantageously store the filehandles of previously accessed files for use in subsequent accesses.

[0307] File Handle Look-Up

[0308] In order to access a desired file, the client 110 sends thefile's file handle 1300 and a request for file access to the file system250. As was illustrated in the embodiment shown in FIG. 13, the filehandle 1300 of a given file comprises information to identify the serverthat stores the file and the location of the file's G-node 600 in theG-node Table 330. With the information found in the G-node 600, asdescribed in the example of FIG. 6, the desired file can be located andaccessed.

[0309] The file handle 1300 for a given file remains relatively staticover time, and, typically, the client 110 stores the file handles 1300of files that it has already accessed for use in subsequent accessrequests. If the client 110 does not have a desired file's file handle1300, the client 110 can request a file handle look-up from the filesystem 250 to determine the needed file handle 1300.

[0310] In one embodiment of a file handle look-up process, the DFSS 100accepts the file handle 1300 of a parent directory along with thefilename of a desired child file, and the DFSS 100 returns the filehandle 1300 for the desired child file. If the client 110 does not knowthe file handle 1300 for the desired file's parent directory, then theclient 110 can use the file handle 1300 for any directory along thepathname of the desired file and can request a file handle look-up forthe next component on the desired pathname. The client 110 can theniteratively request a file handle look-up for each next component of thepathname, until the desired file's file handle 1300 is returned.

[0311] For example, if the client 110 desires the file handle 1300 for afile whose pathname is “root/WorkFiles/PatentApps/DesiredFile” and ifthe client 110 has the file handle 1300 for the parent “Patent Apps”directory, then the client 110 can send the look-up request with the“PatentApps” file handle 1300 to get the “DesiredFile” file handle 1300.If the client initially has no file handle 1300 for the parent“PatentApps” directory, but does have the file handle 1300 for the“WorkFiles” directory, then the client 110 can send a first look-uprequest with the known “WorkFiles” file handle 1300 together with thefilename for the “PatentApps” directory. The DFSS 100 returns the filehandle for the “PatentApps” directory. Since the client 110 still doesnot have the needed “DesiredFile” file handle 1300, the client 110 cansend a second file handle look-up request, this time using the newlyreceived “PatentApps” file handle and the “DesiredFile” filename. Inresponse, the file system 250 returns the “DesiredFile” file handle1300. In this way, beginning with the file handle 1300 for any filealong the pathname of a desired file, the file handle 1300 for thedesired file can eventually be ascertained.

[0312] In one embodiment, when the client 110 first accesses the filesystem 250, the client 110 is provided with one file handle 1300, namelythe file handle for a “root” directory. The “root” directory is thedirectory that contains all other directories, and is therefore thefirst component on the pathname of every file in the system. Thus, ifneed be, the client 110 can begin the look-up process for any file'sfile handle 1300 with a look-up request that comprises the “root” filehandle and the filename of the next component of the desired file'spathname. The final file handle returned will provide the client withthe information needed to accurately locate the desired file.

[0313]FIG. 14A shows an example of the file handle look-up procedure inwhich the client 110 has a file handle 1300 for a desired file's parentdirectory and needs a file handle for the desired file itself. Theclient 110 initiates a look-up for the desired file handle by sending alook-up request 1410 that comprises a filename 1420 of the desired fileand the file handle 1300 of the parent directory. The ServerID field1320 in the file handle 1300 identifies the server 130 of the node 150where the parent directory is stored, and the file system software 250directs the look-up request 1410 to the identified server 130. TheG-node index field 1330 stores an index for the parent directory'sG-node in the G-node Table 330 on the identified server.

[0314] In this example, the filename 1420 of the desired file is“AAAAA.” The ServerID field 1320 indicates that the parent directory isstored on the server 130 with ServerID “123,” and the G-node index field1330 shows that a G-node for the parent directory can be found at indexlocation “1” in the G-node Table 330.

[0315] When the server 130 receives the look-up request 1410, the server130 uses information in the G-node index field 1330 of the file handle1300 to access a G-node 1432 at index location “1.”

[0316] As described above, the G-node 600 acts as a repository ofgeneral information regarding a file. In the example illustrated in FIG.14A, the File Attribute-type field 602 of the G-node 1432, namely“NFDIR,” indicates that the file associated with the G-node 1432 is adirectory, not a regular data file.

[0317] As described earlier, the Gnid-string 700 holds information aboutthe children files of a given directory. The Child Gnid Index 628 inG-node 1432 points to a first Gnid 1436 in the directory's Gnid-string700. The server 130 searches for the desired data file amongst thechildren files of the parent directory by searching the correspondingGnids on the directory's Gnid-string. The server 130 uses the FilenamePtr fields 760 of each Gnid 710 to access the associated file's filenameentry 410 for comparison with the filename 1420 of the desired file.

[0318] In FIG. 14A, the Child Gnid Index field 628 of G-node 1432indicates a value of “3,” and the server 130 accesses the Gnid 1436 atindex location “3” in the Gnid Table 340. To determine a filenameassociated with the Gnid 1436, the server 130 uses the Filename Ptrfield 760 to access the Filename Entry 1438 associated with the Gnid1436 at index “3.” To ascertain if the filename stored at the FilenameEntry 1438 matches the filename 1420 in the look-up request 1410, theserver 130 first compares the checksum and filename length count of thefilename 1420 in the look-up request 1410 with the checksum 412 and thefilename length count 413 stored in the Filename Entry 1438 in theFilename Table 310. (Note: These checksums and filename lengths are notshown explicitly in FIGS. 14A and 14B.) If the aforementioned checksumsand filename length counts match, the server 130 proceeds with acharacter-by-character comparison of the character string 1420 in thelook-up request 1410 and the filename 414 in the Filename Entry 1438.

[0319] If a mismatch is encountered during the comparisons, as is thecase in FIG. 14A, where the Filename Entry 1438 holds a filename of“ABCD” and length “4” while the desired filename of “AAAAA” has a lengthof “5,” then the current Gnid is eliminated from consideration. Afterencountering a mismatch for the Gnid 1436 at index “3,” the server 130continues to traverse the Gnid-string 700 by using the Sibling Gnid Ptrfield 740 in the current Gnid 1436 as an index pointer.

[0320] The Sibling Gnid Ptr field 740 of the Gnid 1436 holds a value of“4,” indicating that a next Gnid 1440 can be found at index location “4”of the Gnid Table 340. When the checksum and name length for the desiredfilename 1420 do not match those from a Filename Entry 1442 “DE” foundat index location “0” of the Filename Table 310, the server 130 againeliminates the current Gnid from consideration.

[0321] The server 130 again uses the Sibling Gnid Ptr field 740 as apointer, this time from the Gnid 1440 at index location “4” to a Gnid1444 at index location “6” in the Gnid Table 340. Following the FilenamePtr 760 of the Gnid 1444 to Filename Entry 1446 and performing theaforementioned checksum, filename length, and filename comparisonsreveals that the desired filename 1420 and Filename Entry filename 1446do match. The server 130 therefore determines that this Gnid 1444 isassociated with the desired file.

[0322] In order to send the desired file handle 1300, which comprisesthe ServerID 1320 and G-node Table index 1330 for the desired file, tothe requesting client 110, the server 130 accesses the G-node Ptr field750 of the current Gnid 1444. The G-node 600 of a file is stored on theserver node 150 where the file is stored, which is not necessarily thesame server node that holds its parent directory. The G-node Ptr field750 provides both the ServerID of the server that is the file's primaryowner and an index that identifies the file's G-node 1448 in the primaryowner's G-node Table 330.

[0323] In the example of FIG. 14A, the contents of the G-node Ptr field750 show that the desired G-node 1448 exists at location “9” in theG-node table 330 on the same server 130, namely the server with ServerID“123.” However, it would also be possible for the G-node Ptr field 750to contain an index to a G-node Table 330 on another server 132, inwhich case, the file handle 1300 would include the ServerID of theserver 132 holding the file and its G-node 600. (This possibility isindicated by the dotted arrow 1460 pointing from the G-node Ptr field750 to another server 132 of the DFSS 100.) Thus, the information in theG-node Ptr field 750 allows the server 130 to provide the client 110with both a ServerID 1320 and with the G-node Index 1330 needed tocreate the file handle 1300 for the desired file. The file handle 1300for the desired file can be sent back to the client 110 for use infuture access of the desired file, and the process of file handlelook-up is complete.

[0324]FIG. 14B shows one example of a file access operation, illustratedusing the same context as was used in FIG. 14A. Here, the client 110already has a file handle 1301 for the desired file, so an accessrequest 1411 can be sent directly to the file system 250. As previouslydisclosed, the user on the client 110 has no need to be aware of thespecific server node 150 that will be accessed. This information isembedded in the desired file's file handle 1301.

[0325] The server 130, indicated in a ServerID field 1321, accesses theG-node 1448 at index “9” as indicated in a G-node index field 1331 ofthe file handle 1301.

[0326] As disclosed above, the Gee Table 320 holds information about thephysical storage locations of a file's data and parity blocks on thedisk array 140. The Gee Table 320 also holds information that helpslocate blocks of data that have been copied to cache. A Gee holdsstorage location information about one block of data. Gees for a givenfile are linked together to form the gee-string 500. A first Gee of thegee-string 500 is called the root of the gee-string 500.

[0327] The Gee Index-Root field 636 of the G-node 1448 provides an indexto a root Gee 1450 in the Gee Table 320. Reading the data field 591 ofthe Gee 1450 confirms that this Gee is a root Gee and that it isassociated with the G-node 1448 at index location “9.” The server 130continues reading the gee-string at the next contiguous Gee 1452 in theGee Table 320. Reading the G-code 590 of the Gee 1452 with its value of“CACHE DATA” reveals that this Gee represents data that has been cached.

[0328] As disclosed above, the Cache Node Table 350 holds informationthat allows the server 130 to access a file block's location in cache1456. Reading the Data Field 591 of a next Gee 1452 provides a pointerto an appropriate cache node 1454 of the Cache Node Table 350. The cachenode 1454 holds the CacheBlockAddr field 825 which points to a location1458 in cache 1456 of the data associated with the Gee 1452. The cachenode 1454 also holds a copy of the associated Gee 1452 from the GeeTable 320 in the Data Gee field 810 until the associated data block 1458is no longer stored in cache. The Data Gee field 810 also provides apointer to the location of the associated file data stored on the servernode's disk array 140. By following the pointers from the file handle1301 to the G-node 1448 at index location “9”, on to the Gees 1450 and1452 at index locations “2” and “3,” on to the Cache Node 1454 at indexlocation “7,” and finally on to cache location “w” 1458, the dataoriginally requested by the client 110 can be accessed for reading,writing, or other operations, and the process of file access iscomplete.

[0329] FIGS. 15-17 present a set of interrelated flow charts thatillustrate the process of file access, including file handle look-up, ifnecessary.

[0330] Referring to FIG. 15, a process 1500 of accessing a file isdescribed, beginning with the request for a file handle look-up, throughthe use of the file system's metadata structures, to final access of thefile data in cache.

[0331] Beginning at a start state 1505, the process 1500 moves to astate 1510 where the client 110 determines whether it has the filehandle 1300 for a file that it wishes to access.

[0332] If the client 110 does not have the desired file handle 1300, theprocess 1500 moves to a state 1515, where the client 110 and one or moreservers of the DFSS 100 perform a file handle look-up, as will bedescribed in greater detail with reference to FIG. 16.

[0333] Returning to the state 1510, if the client 110 determines that itdoes have the desired file handle 1300, then the process 1500 moves onto a state 1520 where the client 110 sends the file access request 1411to the server 130 indicated in the file handle 1300.

[0334] From state 1520, the process 1500 moves to a state 1525 where theserver 130 accesses a G-node 600 indicated in the file handle 1300.

[0335] Moving on to a state 1530, the server 130 uses a pointer in theG-node 600 to access an appropriate Gee in the Gee Table 320. Severalpossibilities exist for appropriate gees, depending on the currentaccess needs of the server 130. For example, in the embodiment of theG-node 600 described in FIG. 6, seven fields 630-636 relate to pointersto the Gee Table 320. The Gee Index—Root field 636 is an index to theroot Gee, which can be used, for example, when reading from thebeginning of a file is desired. Fields 634 and 635 together point to thelast Gee of a file, which can be used, for example, when appending newdata to the end of a file. Fields 630 and 631 together point to a mostrecently used Gee for the file, which can be used, for example, forsequential access to the gees of a file. Fields 632 and 633 togetherpoint to a middle Gee for the gee-string 500 which can be used, forexample, when access to the middle, or second half, of the file isdesired.

[0336] After accessing an appropriate Gee in the state 1530, the process1500 moves on to a state 1535 where the server 130 reads the G-codefield 590 in order to determine if the data represented by the Gee hasalready been cached. If the G-code 590 holds a value other than “CACHEDATA” or “CACHE PARITY,” the server 130 assumes that the desired datahas not yet been cached, and the process 1500 moves to a state 1540where the desired data is sent to cache. The state 1540 is described ingreater detail in connection with FIG. 17 below.

[0337] Returning to the state 1535, if the server 130 determines thatthe G-code 590 holds a value of “CACHE DATA” or “CACHE PARITY,” theserver 130 assumes that the desired data has already been cached. Theprocess 1500 then moves on to a state 1545 where the server 130 accessesthe cache node 800 indicated in the gee's data field 591.

[0338] From the state 1545, the process 1500 moves on to a state 1550where the server 130 manipulates the accessed cache node 800 as neededaccording to the description of FIG. 8B. For example, if the cache node800 is currently on the Normal List 860, and the client 110 hasrequested to read the data block, the server 130 can increment the cachenode's ReadCt field 830 and move it to the Read List 870.

[0339] Once the Cache Node 800 is properly updated, the process 1500moves from the state 1550 to a state 1555 where the server 130 accessesthe file block data in the cache location indicated in the Cache Node800. From here, the process 1500 moves on to a state 1560 where theserver 130 performs a desired operation on the cached data block. Fromthe state 1560, the process 1500 moves on to a state 1570 whereaccessing the file is complete.

[0340] In FIG. 15, the process 1500 reaches the state 1515 only if theclient 110 does not have a file handle 1300 for the desired file.Referring to the embodiment of the file handle 1300 illustrated in FIG.13, the file handle 1300 for a given file comprises, among otherpossible fields, a ServerID field 1320 identifying the server 130 thatstores the data and metadata for a file, as well as a G-node Index field1330 that indicates the G-node 600 of the given file on that identifiedserver 130.

[0341]FIG. 16 is a flow chart that describes in more detail how theprocess of the state 1515 carries out a file handle look-up. The look-upprocess 1515 begins with a look-up request that comprises the filehandle 1300 for a directory on the pathname of the desired file andcontinues on through each component of the pathname, retrieving a filehandle for each, until a file handle for the desired file itself isreturned to the client 110.

[0342] The “root” directory is the first component of the pathname forevery file in the file system, and, if necessary, the client 110 canbegin the process of file handle look-up 1515 with the file handle ofthe “root” directory. In one embodiment, every client has at least thefile handle 1300 for a “root” directory for the file system 250. Forexample, the “root” directory can be known to reside on the server 130with ServerID number 0, and its G-node 600 can be known to reside atindex 0 of the G-node Table 330 on Server 0. However, it may also bethat at the beginning of the look-up process 1515, the client 110 hasthe file handle 1300 for the desired file's parent directory or foranother directory on the pathname of the file, and that by beginningwith one of these directories “closer” to the file itself, the look-upprocess may be shortened.

[0343] Beginning at a start state 1605, the process 1515 moves to astate 1610 where the client 110 sends the look-up request 1410comprising the file handle 1300 for a directory and the filename 1420 ofa desired next component. The look-up request 1410 is sent to a server1300 indicated in the file handle 1300 field of the look-up request1410. The process 1515 next moves to a state 1615, where the server 130accesses a G-node 600 indicated in the file handle 1300 of the look-uprequest 1410.

[0344] Moving on to a state 1620, the server 130 uses the ChildGnidIndexfield 628 in the G-node 600 to access a first Gnid 710 in thedirectory's Gnid-string 700. As described in connection with theembodiment shown in FIG. 7, the Gnid-string 700 is a linked list ofGnids 710, with one Gnid 710 for each child file in a parent directory.

[0345] Moving on to a state 1625, the server 130 calculates a checksumand filename length for the filename 1420 of the next desired pathnamecomponent that was sent by the client 110 in the look-up request 1410.Having a checksum and filename length for a desired file allows theserver 130 to expedite searching for a matching Filename Entry becausecomparison of checksums and comparison of filename lengths can beaccomplished much more quickly than a character-by-character comparisonof the filenames themselves. Performing the first two types ofcomparisons before embarking on the character-by-character comparisonallows the server 130 to eliminate any Filename Entries whose checksumand filename length do not match, before performing the more costlycharacter-by-character filename comparison.

[0346] Moving on to a state 1630, the server 130 uses the FilenamePtrfield 760 of the currently accessed Gnid 710 to locate the associatedFilename Entry 410 in the Filename Table 310. Moving on to a state 1635,the server 130 determines if the checksum 412 stored in the currentlyaccessed Filename Entry 410 is greater than the checksum calculated inthe state 1625.

[0347] As described in connection with FIG. 7, in one embodiment, Gnids710 are stored in the Gnid-string 700 in order of checksum 412 valuescalculated for their associated character strings 414, with the Gnid 710having the smallest checksum 412 value coming first. This ordering ofGnids 710 by checksum 412 value allows the server 130 to determinewhether a desired filename may still exist on the given Gnid-string 700.In this embodiment, if, in the state 1635, the server 130 determinesthat the checksum 412 found in the currently accessed Filename Entry 410is greater than the checksum calculated in the state 1625, then a Gnid710 for the desired file (with the lower checksum) cannot exist on thecurrently accessed Gnid-string 700. In this case, the process 1515 moveson to a state 1640, where it reports a File-Not-Found Error to theclient 110.

[0348] Returning to the state 1635, if the server 130 determines that achecksum found in a currently accessed Filename Entry is greater thanthe checksum calculated in state 1625, then the process 1515 moves on toa state 1645.

[0349] In the state 1645, the server 130 determines if the checksums andthe filename lengths from the two sources match. If either the checksumsor the filename lengths (or both) do not match, then this Filename Entrycan be ascertained not to be associated with the client's desired file,and the process 1515 moves on to a state 1660. In the state 1660, theserver 130 uses the SiblingGnidPtr 740 in the current Gnid 710 to accessthe next Gnid in the current Gnid-string.

[0350] Returning to the state 1645, if the server 130 determines thatthe checksums and filename lengths do match, then this Filename Entry410 cannot yet be eliminated, and the process 1645 moves on to a state1650, where the server 130 performs a character-by-character comparisonof the two filenames.

[0351] If, in the state 1650, the server 130 determines that the twofilenames do not match, then, as was the case in state 1645, thisFilename Entry can be ascertained not to be associated with the client'sdesired file. In this case, the process 1515 moves on to a state 1660,where the server 130 uses a SiblingGnidPtr 740 in the current Gnid toaccess a next Gnid 711 in the current Gnid-string 700.

[0352] From the state 1660, the process 1515 returns to the state 1630,and the server 130 uses the Filename Ptr field 760 of the newly accessedGnid 711 to access an associated Filename Entry in the File Table 310.This loop through the states 1630, 1635, 1645, 1660 (and possibly 1650)continues until a Filename Entry and associated Gnid for the desiredfile is found or until an error is encountered.

[0353] If, in the state 1650, the server 130 determines that thefilenames do match, then the process 1515 has identified a FilenameEntry and an associated Gnid that corresponds to the desired file. Inthis case, the process 1515 moves on to a state 1655, where the server130 sends the desired file handle 1300 information back to the client110. Moving on to a state 1665, the file handle look-up process 1515 iscomplete. The process 1500 from FIG. 15 then proceeds from the state1515 back to the state 1510 and continues as described in theexplanation of FIG. 15.

[0354]FIG. 17 presents a more detailed description of the state 1540from FIG. 15, in which uncached data that has been requested for accessby the client 110 is copied into cache memory. The process 1540 ofcaching file data begins in a start state 1705 and proceeds from thereto a state 1710, where the server 130 identifies the least recently usedcache node 880. In one embodiment of the file system 250, when thethree-list scheme described in FIG. 8B is used, the server 130 caneasily identify the least recently used cache node 880 because it is a“last” cache node on the Normal List 860 of the scheme.

[0355] Moving on to a state 1720, the server 130 writes the associatedfile data from its volatile location in cache to its non-volatilelocation on disk array 140, which is indicated in the DataGee field 810of the cache node 800.

[0356] Moving on to a state 1730, the server 130 copies the DataGeefield 810 from the cache node 800 back to its original position in theGee Table 320, changing the G-code 590 back from “CACHE DATA” to “DATA”or from “CACHE PARITY” to “PARITY,” indicating that the associated datais no longer cached.

[0357] Moving on to a state 1740, the server 130 overwrites the DataGeefield 810 in the cache node 800 with a Gee from the Gee Table 320 thatis associated with a new file block to be cached.

[0358] Moving on to a state 1750, the server 130 caches the new fileblock from disk to a cache location associated with the cache node.

[0359] Moving on to a state 1760, the process 1540 of caching file datais complete, and the process 1500 in FIG. 15 can proceed from the state1540 on to the state 1545 to continue the task of accessing a file.

[0360] Referring to FIG. 18, a process of file allocation 1800 is shownin flowchart form. The process 1800 begins in a start state 1805 andmoves to a state 1810 where the client 110 send a file allocationrequest that includes a filename for a new file and a file handle forthe new file's parent directory.

[0361] The process 1800 moves to the state 1815, and the server node 150indicated in the parent directory's file handle receives the fileallocation request. For the purposes of the description of this figure,this server node 150 will be known as the “parent” server.

[0362] The process 1800 moves to the state 1820, and the “parent” server150 uses workload statistics received from the other server nodes of theDFSS 100 to decide if the file will be “owned” by the “parent” servernode 150 or by another server node.

[0363] If the “parent” server node 150 decides that it will be the ownerof the new file, then the process 1800 moves to a state 1830, where the“parent” server creates a new file, makes an appropriate new FilenameEntry 410 in the Filename Table 310, and allocates a new G-node 600 forthe new file. At this point, the “parent” server node 150 has enoughinformation to create the file handle 1300 for the new file.

[0364] Returning to the state 1820, if the “parent” server node 150decides that another server node should own the new file, the process1800 moves to a state 1850, where the “parent” server 150 sends a fileallocation request to another server of the DFSS 100. For the purposesof describing this figure, the other server will be known as the“second” server.

[0365] From the state 1850, the process 1800 moves to a state 1855 wherethe “second” server creates a new file, makes the appropriate newFilename Entry 410 in the Filename Table 310, and allocates the newG-node 600 for the new file. At this point, the “second” server hasenough information to create the file handle 1300 for the new file.

[0366] From the state 1855, the process 1800 moves on to a state 1860,where the “second” server sends the file handle 1300 for the new file tothe “parent” server node 150.

[0367] At this point, when the “parent” server node 150 has the filehandle 1300 for the new file, the process 1800 moves on to a state 1835.

[0368] The state 1835 can also be reached from state 1830 in the casewhere the “parent” server 150 decided to be the owner of the file. Asdisclosed above, in state 1830 the “parent” server 150 also had theinformation to create a file handle 1300 for the new file, and theprocess 1800 also moves on to a state 1835.

[0369] For either case, in state 1835, the “parent” server node 150, asowner of the new file's parent directory, allocates a Gnid 710 for thenew file, adds it to the appropriate Gnid-string 700, and, if one doesnot already exist, the “parent” server node 150 makes an appropriate newFilename Entry 410 in the Filename Table 310.

[0370] From state 1835, the process 1800 moves on to a state 1840, wherethe “parent” server node 150 sends the file handle 1300 for the new fileto the requesting client 110.

[0371] The process 1800 moves on to a state 1845 where the process offile allocation is now complete. The requesting client 110 can accessthe new file using the newly received file handle 1300, and since thefile handle 1300 contains identification for the server that owns thenew file, any access request can be automatically routed to theappropriate server node.

[0372] Redirectors

[0373] In various embodiments, the DFSS 100 can be configured to storeand manage a very large number of files of widely varying sizes. In someembodiments, it can be advantageous to store all of the file metadata ondisk, while copies of the metadata for only some of the most recentlyused files are additionally cached in volatile memory. In someembodiments, memory for metadata structures can be dynamically allocatedas new metadata structures are brought from disk to volatile memory.

[0374]FIG. 19 depicts one embodiment of a scheme to allow for efficientaccess to file metadata when not all metadata is kept in volatilememory. In the embodiment shown in FIG. 19, a G-node Redirector (GNR)array 1900 in volatile memory holds a G-node Redirector (GNR) 1910 perfile. The G-node Redirector (GNR) is a small data structure thatcomprises information for locating the G-node 600 of a desired file,including information regarding whether the file's G-node 600 iscurrently in cache 1920. In the embodiment shown in FIG. 19, a client110 requesting access to a given file sends a file handle 1300 thatincludes an index for the desired G-node Redirector (GNR) 1910 in theG-node Redirector (GNR) array 1900, which references the G-node 600 ofthe desired file. In one embodiment, when a desired G-node 600 is notcurrently cached, a least recently used G-node 600 in cache 1920 can beremoved from cache 1920, and a copy of the desired G-node 600 can bebrought from the disk array to the cache 1920.

[0375] Super G-nodes

[0376] In one embodiment, the file system 250 can be advantageouslyconfigured to store file metadata in a data structure called a SuperG-node (SG) that comprises the file's G-node, other file information,and information that allows the file system 250 to locate the physicalstorage locations of the file's data blocks, as will be described ingreater detail below.

[0377]FIG. 20A shows one embodiment of a Super G-node 2000 structure offixed size that can provide location information for files of a widevariety of sizes. As shown in FIG. 20A, a Status field 2010 in the SuperG-node 2000 can be used to indicate a type of Super G-node thatcorresponds to a category of associated file sizes, as will be describedin greater detail with reference to FIG. 20B. A Linking Informationfield 2020 can be used to interconnect Super G-nodes 2000 into one ormore linked lists or other structures. A G-node field 2030 comprisesattribute and other information about a corresponding file that issimilar to the information stored in the G-node 600 embodiment describedwith reference to FIG. 6. A File Location Data field 2040 in the SuperG-node 2000 allows the file system 250 to locate a file's data, as willbe described in greater detail below.

[0378] In the embodiment shown in FIG. 20A, the Super G-node 2000comprises 16 Kbytes of memory. The Status 2010, Linking Information2020, and G-node 2030 fields together comprise 128 Bytes of the SuperG-node 2000, and the remainder of the Super G-node can be used to storethe File Location Data 2040.

[0379]FIG. 20B depicts one embodiment of a scheme that uses SuperG-nodes 2000 of a fixed size to hold information about files of widelydiffering sizes. In the embodiment shown in FIG. 20A, four types2001-2004 of Super G-node 2000 are depicted.

[0380] A Super G-node 2000 of type Super G-node Data (SGD) 2001 can beused for a file that is small enough that its data 2005 can fit entirelywithin the File Location Data 2040 field of the SGD 2001. For theembodiment described with reference to FIG. 20A, a small file refers toa file that is 16,256 Bytes, or smaller. When a file's Super G-node 2000is of type SGD 2001, locating the file's data simply means reading itfrom the File Location Data 2040 field of the SGD 2001.

[0381] In the embodiment shown in FIG. 20B, a Super G-node 2000 of typeSuper G-node Gee (SGG) 2002 can be used for medium files, that is, filesof sizes up to approximately 700 MegaBytes of data that are too large tofit into an SGD 2001. In an SGG 2002, the File Location Data 2040 fieldis used to hold a Gee String Packet (GSP) 2007 that comprisesinformation very similar to that of the Gee-String 500 described withreference to FIG. 5. As with the Gee-String 500, the Gee String Packet2007 comprises Gees 2006 that point to the physical locations of thefile's data 2005.

[0382] A Super G-node 2000 of type Super G-node List (SGL) 2003 can beused for large files whose Gee-String 500 is too large to be describedby a Gee String Packet 2007 that fits within the SGL's 2003 FileLocation Data 2040 field. Instead, the SGL's 2003 File Location Data2040 field is used to hold a Gee String Packet Block (GSPB) 2008, whichis a list of pointers to a plurality of Gee String Packets 2007 thattogether describe the Gees 2006 that point to the locations of thefile's data 2005. In one embodiment, an SGL 2003 can reference files ofsizes up to approximately 490 GigaBytes.

[0383] A Super G-node 2000 of type Super G-node List of Lists (SGLL)2004 can be used for very large files. Here, the File Location Data 2040field of the SGLL 2004 comprises a Gee String Packet List Block 2009that comprises pointers to a plurality of Gee String Packet Blocks 2008that point to a plurality of Gee String Packets 2007 that points to aplurality of Gees 2006 that point to a plurality of storage locationsthat hold the desired data 2005.

[0384] In one embodiment, Gee String Packet List Blocks 2009, Gee StringPacket Blocks 2008, and Gee String Packets 2007 are implemented instructures that are equivalent in size and organization to the SuperG-node 2000 described with reference to FIG. 20A, except that the G-nodefield 2030 is not used.

[0385] Parity Groups

[0386] The foregoing description of a distributed file storage systemaddresses the need for a fault tolerant storage system with improvedreliability and scalability characteristics. This system features aflexible disk array architecture that accommodates the integration ofvariably sized disk drives into the disk array and provides mechanismsto permit each drive's capacity to be more fully utilized than prior artsystems. In one embodiment, variably sized data and parity blocks aredistributed across the available space of the disk array. Furthermore,the system provides methods of redistributing data across the disk arrayto improve data storage and retrieval, as well as, provide for improvedfault-tolerance. Another benefit of the data redistributioncharacteristics of the system is that it continues to providefault-tolerant data access in situations where many drives of the diskarray have failed. This feature is a notable improvement overconventional RAID systems that typically only provide fault-tolerancefor single (or at most two) drive failures.

[0387]FIG. 22A shows a file storage system 100 having the server node150 that operates within a computer network architecture to provide dataand file storage. The computer network comprises one or more clients 110that exchange information with the server node 150 through thecommunications medium or fabric 120 to store and retrieve desired datafrom the server node 150. In one aspect, the clients 110 include one ormore computing devices that exchange information with the server node150 through the communications medium 120.

[0388] The communications medium 120 can be any of a number of differentnetworking architectures including, for example, Local Area Networks(LAN), Wide Area Networks (WAN), and wireless networks which may operateusing Ethernet, Fibre Channel, Asynchronous Transfer Mode (ATM), andToken Ring, etc. Furthermore, any of a number of different protocols canbe used within the communications medium 120 to provide networkingconnectivity and information exchange capabilities between the clients110 and the server node 150, including, for example, TCP/IP protocols,Bluetooth protocols, wireless local area networking protocols (WLAN), orother suitable communications protocols.

[0389] The server node 150 includes the server 130 that serves as afront end to the disk array 140. The server 130 receives information andrequests from the clients 110 and processes these requests to store andretrieve information from the disk array 140. In one aspect, the server130 maintains at least a portion of an instruction set or file systemthat determines how data and information are stored and retrieved fromthe disk array 140.

[0390] Although the server node 150 is illustrated as a single entity inFIG. 22A, it will be appreciated that many server nodes 150 can beconnected to the communications medium 120. Thus, a plurality of servernodes 150 can be connected to the communications medium 120 andaccessible to the clients 110 for the purposes of information storageand retrieval. Furthermore, the server nodes 150 can operateindependently of one another or be configured to transparently present asingle disk image to each client 110 thus creating a unified storagearea that facilitates end user interaction with the server nodes 150. Inone aspect, the server nodes 150 incorporate functionality formaintaining the single disk image through the use of the file systempresent in each of the servers 130 which provides communication andorganization to create the single disk image.

[0391]FIG. 22B illustrates another embodiment of a file storage systemcomprising a distributed file storage system architecture. In thisembodiment, two or more server nodes 150, 151 are physically orlogically interconnected to form the cluster 160. File data stored onany server node is accessible to any other server in the cluster 160.The cluster 160 may also provide metadata and transaction mirroring.Furthermore, stored files may be replicated across at least two servernodes 150, 151 within the distributed file storage system 100 to provideincreased redundancy or data mirroring capabilities.

[0392] One advantage achieved by the aforementioned distributedconfigurations is that they may provide increased data protection and/orfault tolerance. For example, if the replicated server node 150 fails orbecomes unavailable, the second replicated server node 151 can handleclient requests without service interruption. Another advantage achievedby using this interconnected arrangement is that alternative server nodeaccess paths 165 can be created where identical data can be readsimultaneously from the two or more interconnected server nodes 150,151. Thus, if one server node 150 in the cluster is busy andunavailable, another redundant server node 151 can service clientrequests to increase data throughput and accessibility. As with thesingle server node configuration, a plurality of clusters 160 may bepresent and accessible to the clients 110. Similarly, the clusters 160can be configured to present a single disk image to the clients 110 tofacilitate interaction by the end users of the distributed file storagesystem 100.

[0393] As shown in FIG. 22B, each disk array 140, 141 in the servernodes 150, 151 can include a variable number of disks where each servernode 150, 151 has a different disk array configuration. Each disk withinthe disk array 140, 141 can have a different storage capacity. Thesefeatures of the distributed file storage system 100 contribute toimproved flexibility and scalability in configuring the server nodes150, 151.

[0394] The variable disk configuration of the distributed file storagesystem 100 overcomes a limitation present in many conventional storagesystems which require that upgrades to the storage system be performedin a coordinated manner where all disks in each disk array 140, 141 arereplaced in unison. Additionally, many conventional storage systems,including RAID architectures, require strict conformity amongst the diskarrays within the system, as well as, conformity in disk capacity withinindividual disk arrays. The distributed file storage system 100 of thepresent invention is not limited by the restriction of uniform diskupgrades or conformity in disk capacity and can accommodate replacementor upgrades of one or more drives within each server node with drives ofdiffering capacity. To maintain data integrity and knowledge ofavailable storage space within the distributed file storage system 100,one of the functions of the aforementioned file system present in theservers 130, 131 is to accommodate differences in disk array capacityand disk number between the server nodes.

[0395]FIG. 23 illustrates the use of a distributed file storagemechanism within the disk array 140 to improve space utilization andflexibility of data placement. A space mapping configuration 2300 isillustrated for the disk array 140 where each disk 2305 is subdividedinto a plurality of logical blocks or clusters 2310. For the purposes ofthis illustration the cluster size is shown to be fixed across all disks2305 of the array 140, although, as will be illustrated in greaterdetail in subsequent figures, the cluster size can be variable withineach disk 2305 and across disks 2305 within the array 140.

[0396] A first file 2320 having data to be stored on the disk array 140is subdivided into one or more data blocks. The determination of thedata block size, number, and distribution is calculated by the filesystem as data storage requests are received from the clients 110. Eachdata block 2330 is mapped or assigned to a location within the diskarray 140 that corresponds to the particular disk 2305 and logical block2310 within the disk 2305. Unlike conventional disk arrays, the blocksize used for data storage is variable from one block to the next withinthe file.

[0397] The server 130 organizes and distributes information to the diskarray 140 by dividing a file into one or more data blocks 2330 that aredistributed between one or more parity groups 2335. Each parity group2335 includes a discrete number of data blocks 2330 and further includesa parity block 2337 containing parity information calculated for thedata blocks 2330 contained within the particular parity group 2335.Unlike conventional systems, the size of the data blocks 2330 and parityblocks 2337 is not singularly fixed throughout the disk array 140. Thecollection of data blocks 2330 and parity blocks 2337 can include anumber of different sizes and configurations resulting in more flexiblestorage of data within the disk array 140.

[0398] Using File #1 in FIG. 23 as an example, the information containedin the file is distributed in 7 data blocks corresponding toDATA₁1-DATA₁7. Each data block, DATA₁1-DATA₁7 is distributed between 3parity groups wherein the first parity group contains DATA₁1-DATA₁2 thesecond parity group contains DATA₁3-DATA₁4 and the third parity groupcontains DATA₁5-DATA₁7. Furthermore, 3 parity blocks PARITY₁1-2,PARITY₁3-4, and PARITY₁5-7 are formed, one for each parity group.

[0399] The parity groups 2335 are determined by the server 130 whichassesses the incoming data to be stored in the disk array 140 anddetermines how the data is distributed into discrete data blocks 2330and furthermore how the data blocks 2330 are distributed into paritygroups 2335. After determining the data block and parity groupdistribution, the server 140 calculates the parity information for thedata blocks 2330 in each parity group 2335 and associates the parityblock 2337 containing this information with the appropriate parity group2335.

[0400] The server 130 then determines how the information for eachparity group 2335 is stored within the disk array 140. Each data block2330 and parity block 2337 is distributed within the disk array 140 inan arrangement where no blocks 2330, 2337 originating from the sameparity group 2335 are stored on the same disk of the disk array 140. Thenon-overlapping storage of data blocks 2330 and parity blocks 2337derived from the same parity group 2335 creates the fault-tolerant datastorage arrangement where any block 2330, 2337 within a parity group2335 can be reconstructed using the information contained in the otherremaining blocks of the parity group 2335. This arrangement where blocks2330, 2337 associated with the same parity group 2335 are not be storedon the same disk 140 is important in case of a disk failure within thearray 140 to insure that that lost data can be reconstructed. Otherwise,if two or more blocks associated with the same parity group 2335 arestored on the same drive, then in the event of a disk failure, datarecovery can not be assured.

[0401] An example distribution of data blocks 2330 and parity blocks2337 within the disk array 140 is shown in FIG. 23. The 7 data blocksand 3 parity blocks corresponding to the File #1 are distributed alongdisk numbers 0,1,8,3,7,2 and 2110 respectively. In a similar manner, asecond file 2340 is divided into 4 data blocks (and 2 parity groups)that are distributed along disk numbers 0,2,4, and 5 respectively. Thesize, order, and placement of the data blocks is predetermined by theserver 130 which assigns regions of each disk 2305, corresponding toparticular logical blocks, to store data blocks of designated sizes. Theparity blocks 2337 of the parity groups 2335 associated with the firstfile 2320 are further stored on disks 9,6,11 with the parity blocks 2337of the second file 2340 stored on disks 3, 9.

[0402] The data blocks 2330 and the parity blocks 2337 need not besequentially stored but rather can be distributed throughout the diskarray 140. Using this arrangement, the distributed file storage system100 permits the non-sequential assignment and storage of parity groupinformation in a flexible manner that is not limited by a rigid order orplacement schema. Flexible block placement in the aforementioned mannerimproves disk utilization within the disk array 140 and provides foraccommodating variable disk capacities as will be shown in greaterdetail in subsequent figures.

[0403]FIG. 24A illustrates a process 2400 for the storage of data andparity information within the distributed file storage system 100. Theprocess 2400 commences with a data storage request 2410 issued by theclient 110 to the server node 150. During this time the client 110 sendsor transmits data 2415 to the server node 150 which receives andprepares the data 2420 for subsequent processing and storage. In oneembodiment, the server node 150 includes hardware and/or softwarefunctionality to perform operations such as error checking, databuffering, and re-transmission requests, as needed, to insure that thedata 2415 is received by the server 130 in an uncorrupted manner.Furthermore, the server node 150 is able to process simultaneousrequests from a plurality of clients 110 to improve performance andalleviate bandwidth limitations in storage and retrieval operations. Inone aspect, the data 2415 is transmitted through the communicationsfabric 120 in the form of a plurality of data packets that areautomatically processed by the server node 150 to generate the data 2415that is to be desirably stored within the disk array 140.

[0404] Upon receiving the data 2420, the server 130 analyzes thecharacteristics of the data 2430 to determine how the data 2415 will bedistributed into one or more data blocks 2330. In one aspect, the dataanalysis 2430 includes identifying the content or type of data that hasbeen sent, such as, for example, multimedia data, textual data, or otherdata types. Using one or more of the plurality of available disk blockssizes, the server 130 identifies desirable block sizes and distributionmappings that are used to group the data 2415 and organize it into thedata blocks 2330.

[0405] The data 2415 is then parsed into blocks 2440 according to thedata analysis 2430 and the resulting blocks are further arranged intoone or more parity groups 2450. The parity group arrangementdetermination 2450 distributes the data blocks 2330 between the paritygroups 2335 and dictates the size of the parity blocks 2337 that will beassociated with each parity group 2335. For example, a parity groupcomposed of 3 data blocks having sizes of 128K, 64K, and 256Krespectively will have a different associated parity block size than andparity group composed of 2 data blocks having sizes of 128K and 256K.The server 130 can therefore vary the block size as well as the paritygroup size in a number of different ways to achieve improved storage anddistribution characteristics within the disk array 140.

[0406] In one aspect, the distributed file storage system 100 is animprovement over conventional systems by allowing both data and parityblocks to be assigned to physical disk blocks. Furthermore, the mappingof the data and parity blocks to the physical disk(s) may be performedeither before or after the parity calculations thus improving storageflexibility.

[0407] Upon determining the parity group arrangement 2450, the server130 calculates the parity blocks 2460 for each parity group 2335. Aspreviously described, the parity block calculation 2450 creates afault-tolerant information block which is associated with each group ofdata blocks 2330 within the parity group 2335. The parity block iscalculated 2460 by selecting all data blocks 2330 in a parity group 2335and performing a logical operation on the data 2415 contained therein tocompute error correction information. In one embodiment, theerror-correction information is determined using the logical operation,exclusive OR to generate the parity information. Using thiserror-correcting information the parity block 2337 can be used torestore the information contained in a particular data block 2330 orparity group 2335 that may become corrupted. Furthermore, the parityinformation can be used to restore the contents of entire disks 2305within the disk array using the error correction information inconjunction with other non-corrupted data.

[0408] When the parity groups 2335 have been formed, the server 130 thendetermines how the data blocks 2330 and parity block 2337 for eachparity group 2335 will be distributed 2470 in the disk array. Although,the data 2415 can be striped sequentially across the disks 2305 of thedisk array 140, it is typically more efficient to map and distribute theblocks 2335, 2337 throughout the disk array 140 in a non-sequentialmanner (See FIG. 23). Mapping the data blocks 2330 in this mannerrequires knowledge of how the data blocks 2330 are positioned andordered within the disk array 140. Detailed knowledge of the mapping foreach data block 2330 is maintained by the server 130 using a filestorage mapping structure. This structure will be discussed below inconnection with FIGS. 7 and 9. Using the mapping schema determined bythe server 130, the blocks 2330, 2337 of each parity group 2335 arestored 2480 in the disk array 140.

[0409] As previously indicated, the distributed file storage system 100employs a variable parity approach where the size of the parity block2337 is not necessarily constant. The server 130 creates parity blocks2337 by selecting one of more data blocks 2330 for which errorcorrection information will be computed. The size of the parity block2337 is dependent upon the number of data blocks 2330 whose errorcorrection information is computed and is determined by the server 130.In one aspect, the server 130 selects a parity block size that isconvenient and efficient to store within the existing space of the diskarray 140. The server 130 also provides for distributed placement of theparity blocks 2337 in a manner similar to that of the data blocks 2330.Thus, both data blocks 2330 and parity blocks 2337 are desirably mappedthroughout the disk array 140 with the server 130 maintaining a recordof the mapping.

[0410] The server 130 insures that both data blocks 2330 and parityblocks 2337 are appropriately positioned within the disk array 140 toinsure some level of fault tolerance. Therefore, the server 130desirably distributes selected data blocks and parity blocks containingerror correction information for the selected data blocks onnon-overlapping disks (e.g. all blocks of a parity group are on separatedisks). This insures that if a disk failure does occur, that thecorrupted information can be recovered using the remaining data/parityinformation for each parity group. Upon calculating the appropriateparity information and distribution mapping 2470, the parity blocks 2337are stored in the disk array 2480 in a manner designated by the server130.

[0411]FIG. 24B illustrates another embodiment of a process 2405 for thestorage of data and parity information within the distributed filestorage system 100. As with the aforementioned data and parityinformation storage method 2400, the process begins with the datastorage request 2410 issued by the client 110 to the server node 150.Subsequently, an analysis of the characteristics of the data 2430 isperformed to determine how the data 2415 will be distributed into theone or more data blocks 2330. The data 2415 is then parsed into blocks2440 according to the data analysis 2430 and the resulting blocks arefurther arranged into one or more parity groups 2450. The server 130then determines how the data blocks 2330 and parity block 2337 for eachparity group 2335 will be distributed 2470 in the disk array. At thispoint the client 110 sends or transmits data 2415 to the server node150, which receives and prepares the data 2420 for subsequent processingand storage. After receiving the data 2420, the server 130 calculatesthe parity blocks 2460 for each parity group 2335. Once the data blocks2330 and parity blocks 2337 have been obtained they are stored in thedisk array 2480 in a manner similar to that described with reference toFIG. 24A above.

[0412] In either method of data and parity information storage 2400,2405, the transfer of information from the client 110 may comprise botha parametric component and a data component. The parametric componentdefines a number of parameters used in the storage of information to thedisk array 2480 and may include for example: operation definitions, filehandles, offsets, and data lengths. When using the aforementionedstorage methods 2400, 2405 the parameters and data need not necessarilybe transferred at the same time. For example, the parameters may betransferred during the client storage request 2410 and the data may betransferred anytime thereafter in a subsequent stage of the method 2400,2405. In one aspect, transfer of information using the parametric anddata components desirably allows the distributed file storage system 100to make decisions about how to process the incoming data prior to theactual data transfer to thereby improve the flexibility andfunctionality of the system.

[0413]FIG. 25 illustrates another embodiment of the distributed filestorage system 100 using a variable capacity disk array. The variablecapacity disk array incorporates a plurality of disks 2305 withpotentially non-identical sizes whose space can be addressed and usedfor storing data blocks 2330 and parity blocks 2337. Unlike conventionalRAID storage systems that are limited by the capacity of the smallestdrive within the disk array, the variable capacity disk array cancontain any number or combination of disks and is not limited toaccessing an address space boundary 2490 denoted by the smallest drivein the array. Using similar methods as described previously inconjunction with FIGS. 23 and 24, the server 130 receives files 2320,2340 and determines a parity group distribution for each file such thata plurality of data blocks 2330 and parity blocks 2337 are created. Thedata blocks 2330 and parity blocks 2337 are then distributed throughoutthe disk array 140 in such a manner so as to avoid storing more than oneblock 2330, 2337 from the same parity group 2335 on a single disk 2305.The server 130 stores of these blocks 2330, 2337 across all of theavailable disk space, and thus is able to access disk space that liesbeyond the boundary 2490 defined by the smallest disk capacity (atypical storage boundary which limits conventional systems). As shown inFIG. 25, the distributed file storage system 100 stores both data blocks2330 and parity blocks 2337 throughout the address space of each disk2305 without boundary limitations imposed by other disks within thearray 140.

[0414] In addition to improved space utilization, a number of otherimportant features arise from the aforementioned flexible distributionof the blocks 2330, 2337. In one aspect, using variable capacity disks2305 within the array 140 contributes to improved scalability andupgradeability of the distributed file storage system 100. For example,if the unused storage space within the array 140 fails below a desiredlevel, one or more of the disks within the array 140 can be readilyreplaced by higher capacity disks. The distributed file storage system100 implements an on-the-fly or “hot-swap” capability in which existingdisks within the array 140 can be easily removed and replaced by otherdisks. Since each server in a cluster maintains a copy of the metadatafor other servers in the cluster, servers can also be hot-swapped. Usingthis feature, a new higher capacity disk can be inserted into the array140 in place of a lower capacity disk. The server 140 is designed toautomatically incorporate the disk space of the newly inserted drive andcan further restore data to the new drive that resided on the formersmaller capacity drive. This feature of the distributed file storagesystem 100 provides for seamless integration of new disks into the array140 and facilitates disk maintenance and upgrade requirements.

[0415] In addition to exchanging or swapping existing disks 2305 withinthe array 140, the server 130 can accommodate the addition of new disksdirectly into the array 140. For example, the disk array 140 containingthe fixed number of disks 2305 can be upgraded to include one or moreadditional disks such that the total number of disk in the array isincreased. The server 140 recognizes the additional disks andincorporates these disks into the addressable space of the distributedfile storage system 100 to provide another way for upgrading each diskarray 140.

[0416] In the examples shown above, both the swapping of disks toincrease storage space and the incorporation of additional disks intothe array is facilitated by the flexible block placement and addressingof disk space within the array 140. Unlike conventional systems thathave a rigid architecture where the number of disks within each array isfixed and the addressable disk space is dictated by the smallest diskwithin the array, the distributed file storage system 100 accommodatesmany different disk array configurations. This flexibility is due, inpart, to the manner in which the disk space is formatted, as well as,how the data is arranged and processed by the server 130.

[0417] In one aspect, the flexibility of the distributed file storagesystem 100 is improved through the use of parity groups 2335. In orderto accommodate files with different characteristics, as well as, improvehow information is distributed throughout the disk array 140, paritygroups 2335 are formed with variable block numbers. The block number ofthe parity group is defined by the number of blocks 2330, 2337 withinthe group. For example, a parity group containing 4 data blocks ischaracterized as having a block number of 4. In a similar manner, aparity group containing a single data block is characterized as having ablock number of 1. The block number of the parity group is one factorthat determines the size of the parity group and additionally determinesthe information that will be used to form the parity block.

[0418]FIG. 26A illustrates the formation of variable block number paritygroups in the distributed file storage system 100. In the illustratedembodiment, exemplary parity groups 2502, 2504 are shown with differentextents having 4 and 2 data blocks respectively. The server 130determines the number of data blocks 2330 associated with each group2502,2504 and furthermore determines the distribution of each type ofparity group having specific block numbers that make up the total paritygroup distribution in the disk array 140. This feature of thedistributed file storage system 100 is discussed in connection withFIGS. 29 and 34.

[0419] Data organization and management by the server 130 is maintainedusing one or more data structures that contain information whichidentifies the size and ordering of the data blocks 2330 within eachparity group 2502, 2504. In one embodiment, the ordering or sequence ofthe blocks 2330, 2337 is maintained through a linked list organizationalschema. The linked list contains one or more pointers that act as links2505 between each block 2330, 2337 within the parity group 2335. Thelinks 2505 therefore allow the server 130 to maintain knowledge of theorder of the blocks 2330, 2337 as they are distributed throughout thedisk array 140. As blocks are written to or read from the disk array140, the server 130 uses the links 2505 to identify the order of theblocks 2502, 2504 used for each parity group 2335.

[0420] As shown in FIG. 26B, the distributed file storage system 100 canalso allocate parity groups 2335 on the basis of block size. In theillustrated embodiment, exemplary parity groups 2506, 2508 are shownhaving the same block number of 4 with differing block sizes of 256K and128K respectively. The feature of variable block size allocation withineach parity group 2335 provides yet another way by which the server 130can distribute data and information within the disk array 140 in ahighly flexible and adaptable manner.

[0421] The implementation of parity groups having a plurality ofdifferent block numbers, as well as allowing for the use of differentblock sizes within each block, improves the ability of the server 130 toutilize available disk space within the array 140. Furthermore, usingcombinations of different data block and parity group characteristicsallows the server to select combinations that are best suited forparticular data types.

[0422] For example, large data files such as multimedia video or soundare well suited for storage using large parity groups that contain largeblock sizes. On the other hand, smaller files such as short text filesdo not have the same space requirements as the larger file types andthus do not significantly benefit from storage in a similar block size.In fact, when small files are stored in large blocks, there is thepotential for wasted space, as the smaller file does not use all of thespace allocated to the block. Therefore, the distributed file storagesystem 100, benefits from the ability to create data blocks 2330 andparity groups 2335 of variable sizes to accommodate different data typesand permit their storage in a space-efficient manner.

[0423] As discussed in connection with FIGS. 14, the distributed filestorage system 1100 further improves the utilization of space within thedisk array 140 by implementing a mechanism for reorganizing theallocation of data blocks as needed to accommodate data stored to thedisk array 140. Furthermore, a redistribution function (shown in FIG.36) can alter the composition or distribution of blocks 2330, 2337 orparity groups 2335 within the array 140 to make better use of availablespace and improve performance by reorganizing information previouslywritten to the array 140.

[0424] In order to maintain coherence in the data stored to the diskarray 140, knowledge of the size and ordering of each block within theparity group 2335 is maintained by the server 130. Prior to writing ofdata to the disk array 140, the server 130 creates a disk map thatallocates all of the available space in the disk array 140 for storingparticular blocks sizes and/or parity group arrangements. Spaceallocation information is maintained by the server 140 in a metadatastructure known as a Gee Table. The Gee Table contains information usedto identify the mapping and distribution of blocks within the disk array140 and is updated as data is stored to the disks 2305.

[0425] The Gee Table stores informational groups which interrelate andreference disk blocks or other discrete space allocation components ofthe disk array 140. These informational groups, referred to asGee-strings, contain disk space allocation information and uniquelydefine the location of files in the disk array 140. Each Gee-string issubdivided into one or more Gee-groups which is further divided into oneor more Gees containing the physical disk space allocation information.The Gee-strings and components thereof are interpreted by the server 130to define the mapping of parity groups 2335 in the disk array 140 whichstore information and files as will be discussed in greater detailhereinbelow.

[0426] Based on the available space within the disk array 140, theserver 130 determines the type and number of parity groups 2335 thatwill be allocated in the array 140. The initial parity group allocationprior to data storage forms the Gee Table and directs the storage ofdata based on available parity groups. The Gee Table therefore serves asa map of the disk space and is updated as data is stored within theblocks 2330, 2337 of the array 140 to provide a way for determining thefile allocation characteristics of the array 140. The server 130retrieves stored files from the disk array 140 using the Gee Table as anindex that directs the server 130 to the blocks 2330 where the data isstored so that they may be retrieved in a rapid and efficient manner.

[0427]FIG. 27 illustrates a portion of a Gee Table used to determine themapping of parity groups 2335 in the disk array 140. For additionaldetails of this architecture the reader is directed to sections whichrelate specifically to the implementation of the file system.

[0428] In one embodiment, space allocation in the disk array 140 isachieved using a Gee Table 2530 containing an index field 2532, a G-codefield 2534, and a data field 2536. The index field 2532 is a value thatis associated with a row of information or Gee 2538 within the Gee Table2530 and is used as an index or a pointer into the array or listcomprising the Gee Table 2530. Additionally, the index field 2532uniquely identifies each Gee 2538 within the Gee Table 2530 so that itcan be referenced and accessed as needed.

[0429] The G-Code field 2534 indicates the type of data that is storedin the disk space associated with each Gee 2538 and is further used toidentify space allocation characteristics of the Gees 2538. Duringinitialization of the disk array, the server 140 assigns all of the diskspace within the array 140 to various parity groups 2335. These paritygroups 2335 are defined by the block size for data and parity blocks2330, 2337 and the number of data blocks within the group 2335.Identifiers in the G-Code field 2534 correspond to flags including“FREE”, “AVAIL”, “SPARE”, “G-NODE”, “DATA”, “PARITY”, “LINK”,“CACHE-DATA”, or CACHE-PARITY”.

[0430] The data field 2536 stores data and information interpreted bythe server 130 in a specific manner depending upon the G-code fieldidentifier 2534. For example, this field can contain numerical valuesrepresenting one or more physical disk addresses defining the locationof particular blocks 2330, 2337 of the parity groups 2335. Additionally,the data field 2536 may contain other information that defines thestructure, characteristics, or order of the parity blocks 2335. As willbe described in greater detail hereinbelow, the information contained inthe G-table 2530 is accessed by the server 130 and used to store andretrieve information from the disk array 140.

[0431] In one embodiment, the fields 2532, 2534,2536 of the G-table 2530map out how space will be utilized throughout the entire disk array 140by associating each physical block address with the designated Gee 2538.Parity groups 2335 are defined by sets of contiguous Gees 2538 that areheaded by the first Gee 2538 containing information that defines thecharacteristics of the parity group 2335. The G-Code field identifier“G-NODE” instructs the server 130 to interpret information in the datafield 2536 of a particular Gee 2538 having the “G-NODE” identifier asdefining the characteristics of a parity block 2335 that is defined by aG-group 2540.

[0432] A characteristic defined in the data field 2536 of the Gee 2538having a “G-NODE” identifier includes an extent value 2542. The extentvalue 2542 represents the extent or size of the blocks 2330, 2337associated with each Gee 2538 in a particular G-group 2540. The extentvalue 2542 further indicates the number of logical disk blocksassociated with each file logical block 2330, 2337. For example, the Geewith an index of “45” contains the G-Code identifier “G-NODE” and has avalue of “2” associated with the extent value. This extent value 2542indicates to the server 130 that all subsequent data blocks and parityblocks defined in the parity group 2335 and represented by the G-group2540 will have a size of 2 logical disk blocks. Thus, as indicated inFIG. 27, the Gees having indexes “46”-“49” are each associated with twological addresses for drive blocks within the array 140. In a similarmanner, the Gee 2538 with an index of “76” contains the G-Codeidentifier “G-NODE” and has an extent value of “3”. This value indicatesto the server 130 that the subsequent Gees “77”-“79” of the parity groupare each associated with 3 physical drive block addresses.

[0433] In the preceding discussion of FIG. 27, information is organizedinto a single G-table however it will be appreciated that there are anumber of different ways for storing the information to improve systemflexibility including the use of multiple tables or data structures. Theexact manner in which this information is stored is desirably designedto insure that it may be efficiently accessed. For example, in oneembodiment nodes of the Gee Table 2530 can be utilized as a commonstorage vehicle for multiple types of metadata, including file names,identifiers (GNIDS), Gees, etc.

[0434] As discussed in connection with FIG. 29, other G-code identifiersare used during the storage and retrieval of information from the diskarray 140. For example, another G-code identifier, “DATA”, signifiesthat the data field 2536 of a particular Gee 2538 is associated with thephysical address for one or more drive blocks that will store data.Likewise, the G-code identifier, “PARITY”, signifies that the data field2536 of a particular Gee is associated with the physical address for oneor more drive blocks that store parity information. The parityinformation stored in the data blocks referenced by the “PARITY” Gee iscalculated based upon the preceding “DATA” Gees as defined by the“G-NODE” Gee. Thus, as shown in the FIG. 27, the Gee 2538 having anindex of “79” will store the physical address of disk blocks thatcontain parity information for data blocks specified by Gees havingindexes “77”-“78”.

[0435]FIG. 28 illustrates a process 2448 used by the server 130 toprepare the disk array 140 for data storage. Preparation of the diskarray 140 commences with the server 130 identifying the characteristics2550 of each disk 2305 within the array 140 to determine the quantity ofspace available. In one embodiment, the server 130 identifies physicalcharacteristics for the drives 2305 within the array 140. Thesecharacteristics can include: total drive number, individual drive size,sectors per disk, as well as other drive characteristics useful indetermining the available space of the disk array 140. To facilitate theconfiguration of the array 140, the server 130 can automatically detectand recognize the presence of each disk 2305 within the array 140 andcan electronically probe each disk 2305 to determine the drivecharacteristics. Alternatively, the server 130 can be programmed withinformation describing the array composition and drive characteristicswithout automatically determining this information from the array 140.

[0436] Upon acquiring the necessary information describing the arraycomposition, the server 130 determines a parity group allotment 2555 tobe used in conjunction with the available disk space. The parity groupallotment 2555 describes a pool of available parity groups 2335 that areavailable for data storage within the array 140. The parity groupallotment further describes a plurality of different block and/or paritygroup configurations each of which is suited for storing particular dataand file types (i.e. large files, small files, multimedia, text, etc).During data storage, the server 130 selects from the available pool ofparity groups 2335 to store data in a space-efficient manner thatreduces wasted space and improves data access efficiency.

[0437] In one embodiment, the parity group allotment is determinedautomatically by the server 130 based on pre-programmed parity groupdistribution percentages in conjunction with available disk space withinthe array 140. Alternatively, the server 130 can be configured to use aspecified parity group allotment 2555 that is provided to the server 130directly. In another aspect, the parity groups can be allocateddynamically by the server based on file characteristics such as filesize, access size, file type, etc.

[0438] Based on the allotment information and the disk space availablein the array 140, the server 130 performs a mapping operation 2560 todetermine how the parity groups 2335 of the allotment will be mapped tophysical block addresses of drives 2305 within the array 140. Themapping operation 2560 comprises determining a desirable distribution ofparity groups 2335 on the basis of their size and the available spaceand characteristics of the disk array 140. As the distribution of paritygroups 2335 is determined by the server 130, the G-table 2530 is createdand populated with Gees 2538 which associate each available parity group2335 with the physical block addresses defining their location on one ormore disks 2305 in the disk array 140. Initially, the G-table 2530describes parity groups 2335 that contain free or available space,however, as data is stored to the disk 2575, the G-table is updated toreflect the contents of the physical disk blocks that are pointed to bythe Gees 2538.

[0439] During operation of the distributed file storage system 100, theG-table 2530 is accessed by the server 130 to determine the logicaladdresses of files and information stored within the disk array 140.Furthermore, server 130 continually updates the G-table 2530 asinformation is saved to the disk array 140 to maintain knowledge of thephysical location of the information as defined by the logical blockaddresses. The dynamically updated characteristics of the G-Table 2530data structure therefore define and maintain the mapping of data andinformation in the disk array 140.

[0440] In addition to the aforementioned a priori method of parity groupallocation other methods of disk preparation may also be utilized. Forexample, another method of disk preparation can use a set of free diskblock maps to allow dynamic allocation of the parity groups. This methodadditionally provides mechanisms for dynamic extension of existingparity groups and includes logic to ensure that the disk does not becomehighly fragmented. In some instances, fragmentation of the disk isundesirable because it reduces the ability to use long parity groupswhen mapping and storing information to the disk.

[0441]FIG. 29 illustrates one embodiment of a file storage schema 2600that uses the aforementioned parity group arrangements 2335 and G-table2530 to store information contained in an exemplary file 2605. The file2605 contains information coded by an electronic byte pattern that isreceived by the server 130 during client storage requests. In thestorage schema 2600, the file 2605 is divided into one or more filelogical blocks 2610 for storage. Each file logical block 2610 is storedin a cluster of one or more disk logical blocks 2615 in the disk array140. As previously indicated, the distributed file storage system 100retains many of the advantages of conventional storage systems,including the distribution of files across multiple disk drives and theuse of parity blocks to enhance error checking and fault tolerance.However, unlike many conventional systems, the distributed file storagesystem 100 does not restrict file logical blocks to one uniform size.File logical blocks of data and parity logical blocks can be the size ofany integer multiple of a disk logical block. This variability of filelogical block size increases the flexibility of allocating disk spaceand thus improves the use of system resources.

[0442] Referring to FIG. 29, the file 2605 is divided into a pluralityof file logical blocks 2610, each of which contains a portion of theinformation represented in the file 2605. The number, size, anddistribution of the file logical blocks 2610 is determined by the server130 by selecting available disk logical blocks 2615 designated in theG-table 2530. The information contained in each file logical block 2610is stored within the disk logical blocks 2615 and mapped using theG-table 2530. In the distributed file storage system 100, the size ofeach file logical block 2610 is described by the extent value 2542 whichis an integer multiple in disk logical blocks 2615. For example, thelogical block designated “LB-1” comprises two disk logical blocks 2615and has an extent value of 2. In a similar manner, the logical blockdesignated “LB-7” comprises three disk logical blocks 2615 and has anextent value of 3.

[0443] The server 130 forms parity groups 2335 using one or more filelogical blocks 2615 and the associated parity block 2337. For each file2605, one or more parity groups 2335 are associated with one another andordered through logical linkages 2617 (typically defined by pointers)used to determine the proper ordering of the parity groups 2335 to storeand retrieve the information contained in the file 2605. As shown in theillustrated embodiment, the file 2605 is defined by a parity string 2620containing four parity groups 2610. The four parity groups are furtherlinked by three logical linkages 2617 to designate the ordering of thelogical blocks “LB-1” through “LB-10” which make up the file 2605.

[0444] The G-table 2530 stores the information defining the G-string2620 using a plurality of indexed rows defining Gees 2538. The Gees 2538define the characteristics of the G-strings 2620 and further describethe logical location of the associated file 2605 in the disk array 140.In the G-table 2530, the G-string 2620 is made up of the one or moreGee-groups. Each G-group is a set of contiguous Gees 2538 that allrelate to a single file. For example, in the illustrated embodiment, theGee-string 2620 includes three Gee-groups 2627, 2628, and 2629.

[0445] The first Gee in each G-group 2627-2629 is identified by theG-Code field identifier “G-NODE” and the data field 2536 of this Geecontains information that defines the characteristics of a subsequentGee 2632 within the Gee-group 2627-2629. The data field 2536 of thefirst Gee in each G-group 2627-2629 further contains information thatdetermines the ordering of the Gee-groups 2627-2629 with respect to oneanother. Some of the information typically found in the data field 2536of the first Gee in each G-group 2627-2629 includes: A G-NODE reference2635 that relates the current G-group with a file associated with aG-node at a particular index (“67” in the illustration) in the G-table2530; the extent value 2542 that defines the size of each file logicalblock 2610 in terms of disk logical blocks 2615; and a root identifier2637 that indicates if the G-group is the first G-group in the G-string.Of a plurality of G-NODE Gees 2630, 2640, 2650, only the first Gee 2630contains an indication that it is a Root Gee, meaning that it is thefirst Gee of the Gee-string 2620.

[0446] Following the G-NODE Gee in a Gee-group are Gees representing oneor more distributed parity groups 2655-2658. A distributed parity groupis set of one or more contiguous DATA Gees followed by an associatedPARITY Gee. A DATA Gee is a Gee with the G-code 2534 of “DATA” thatlists disk logical block(s) where a file logical block is stored. Forexample, in FIG. 29, the Gees with indexes of 46-47, 50-52, 77-79 and89-90 are all DATA Gees, and each is associated with one file logicalblock 2610.

[0447] A PARITY Gee is a Gee with the G-code 2534 of “PARITY.” EachPARITY Gee lists disk logical block location(s) for a special type offile logical block that contains redundant parity data used for errorchecking and error correcting one or more associated file logical blocks2610. A PARITY Gee is associated with the contiguous DATA Gees thatimmediately precede the PARITY Gee. The sets of contiguous DATA Gees andthe PARITY Gees that follow them are known collectively as distributedparity groups 2655-2658.

[0448] For example, in FIG. 29, the PARITY Gee at index 49 is associatedwith the DATA Gees at indexes 46-48, and together they form thedistributed parity group 2655. Similarly, the PARITY Gee at index 53 isassociated with the DATA Gees at indexes 50-52, and together they formthe distributed parity group 2656. The PARITY Gee at index 79 isassociated with the DATA Gees at indexes 77-78, which together form thedistributed parity group 2657, and the PARITY Gee at index 91 isassociated with the DATA Gees at indexes 89-90, which together form thedistributed parity group 2658.

[0449] The size of a disk logical block cluster described by a DATA Geeor a PARITY Gee matches the extent listed in the previous G-NODE Gee. Inthe example of FIG. 29, the G-NODE Gee 2630 of the first Gee-group 2627defines an extent size of 2, and each DATA and PARITY Gee of the twodistributed parity groups 2655, 2656 of the Gee-group 2627 lists twodisk logical block locations. Similarly, G-NODE Gee 2640 of the secondGee-group 2628 defines an extent size of 3, and each DATA and PARITY Geeof the Gee-group 2628 lists three disk logical block locations. G-NODEGee 2650 of the third Gee-group 2629 defines an extent size of 3, andeach DATA and PARITY Gee of the Gee-group 2629 lists three disk logicalblock locations.

[0450] If a Gee-group is not the last Gee-group in its Gee-string, thena mechanism exists to link the last Gee in the Gee-group to the nextGee-group of the Gee-string using the logical linkages 2617. LINK Gees2660, 2661 both have the G-code 2534 of “LINK” and a listing in theirrespective Data fields 2536 that provides the index of the nextGee-group of the Gee-string 2620. For example, the Gee with an index of54 is the last Gee of Gee-group 2627, and its Data field 2536 includesthe starting index “76” of the next Gee-group 2628 of the Gee-string2620. The Gee with an index of 80 is the last Gee of Gee-group 2628, andits Data field 2536 includes the starting index “88” of the nextGee-group 2629 of the Gee-string 2620. Since the Gee-group 2629 does notinclude a LINK Gee, it is understood that Gee-group 2629 is the lastGee-group of the Gee-string 2620.

[0451] As previously indicated, the G-code 2534 of “FREE” (not shown inFIG. 29) indicates that the Gee has never yet been allocated and has notbeen associated with any disk logical location(s) for storing a filelogical block. The G-code 2534 of “AVAIL” (not shown in FIG. 29)indicates that the Gee has been previously allocated to a cluster ofdisk logical block(s) for storing a file logical block, but that the Geeis now free to accept a new assignment. Two situations in which a Gee isassigned the G-code of “AVAIL” are: after the deletion of the associatedfile logical block; and after transfer of the file to another server inorder to optimize load balance for the distributed file storage system100.

[0452]FIG. 30 illustrates a fault recovery mechanism 700 used by thedistributed file storage system 100 to maintain data consistency andintegrity when a data fault occurs. Data faults are characterized bycorruption or loss of data or information stored in one or more logicalblocks 2330 of the array 140. Each data fault can be furthercharacterized as a catastrophic event, where an entire disk 2305 failsrequiring all data on the failed disk to be reconstructed.Alternatively, the data fault can be characterized as a localized event,where the disk 2305 maintains operability but one or more physical disksectors or logical blocks become corrupted or damaged. In eitherinstance of the data fault, the distributed file storage system 100 usesa fault-tolerant restoration process to maintain data integrity.

[0453]FIG. 30 illustrates one embodiment of a fault-tolerant restorationprocess used to maintain data integrity in the distributed file storagesystem 100. As an example of how the process operates, a loss ofintegrity in a data block for a single parity group is shown. It will beappreciated that this loss of integrity and subsequent recoverymethodology can be applied to both instances of complete drive failureor localized data corruption. Thus, the restoration of informationcontained in a plurality of logical blocks can be accomplished usingthis process (i.e. restoring all data stored on a failed disk).Additionally, in instances where parity blocks become corrupted or lost,the information from each parity block can be restored in a similarmanner to the restoration process for data blocks using the remainingnon-corrupted blocks of the parity group.

[0454] In the illustrated embodiment the parity group 2335 includes twodata blocks “DATA₁1” and “DATA₁2” and an associated parity block“PARITY₁1-2” and are shown stored on “DISK 2”, “DISK 8”, and “DISK 11”respectively. Knowledge of the logical disk addresses for each of theseblocks is maintained by the server 130 using the aforementioned G-table2530. As previously discussed, the G-table maintains mapping andstructural information for each parity group defined by the plurality ofGees 2538. The Gees further contain information including; the filedescriptor associated with the blocks of the parity group 2335, the sizeand extent of the blocks of the parity group 2335, and the mapping tothe logical disk space for each block of the parity group 2335. Duringroutine operation, the server accesses data in the disks of the arrayusing the G-table 2530 to determine the appropriate logical disk blocksto access.

[0455] As shown in FIG. 30, a complete disk failure is exemplified wherea loss of data integrity 3072 results in the logical blocks on “DISK 8”becoming inaccessible or corrupted. During the fault tolerantrestoration process the server 130 determines that the data block“DATA₁2” is among the one or more blocks that must be recovered 3074.Using conventional data/parity block recovery methods, the server 130recovers the compromised data block “DATA₁2” using the remaining blocks“DATA₁1” and “PARITY₁1-2” of the associated parity group 2335. Therecovered data block “DATA₁2-REC” is then stored to the disk array 140and contains the identical information that was originally contained in“DATA₁2”. Using the existing G-table mapping as a reference, the server130 identifies a new region of disk space that is available for storingthe recovered data block and writes the information contained in“DATA₁2-REC” to this region. In one embodiment, space for a new paritygroup is allocated and the reconstructed parity group is stored in thenew space. In another embodiment, the “old” parity group having 1 parityblock and N data blocks where one data block is bas, is entered onto thefree list as a parity group having N-1 data blocks. The server 130further updates the G-table 2530 to reflect the change in logical diskmapping (if any) of the recovered data block “DATA₁2-REC” to preservefile and data integrity in the disk array 140.

[0456] One desirable feature of the distributed file storage system 100is that the recovered data block need not be restored to the samelogical disk address on the same disk where the data failure occurred.For example, the recovered data block “DATA₁2-REC” can be stored to“DISK 3” and the G-table updated to reflect this change in blockposition. An important benefit resulting from this flexibility in datarecovery is that the disk array 140 can recover and redistribute datafrom a failed drive across other available space within the disk array140. Therefore, a portion of a disk or even an entire disk can be lostin the distributed file storage system 100 and the data containedtherein can be recovered and moved to other locations in the disk array140. Upon restoring the data to other available disk space, the server130 restores the integrity of the parity group 2335 resulting in thepreservation of fault-tolerance through multiple losses in dataintegrity even within the same parity group without the need forimmediate repair or replacement of the faulted drive to restorefault-tolerance.

[0457] As an example of the preservation of fault tolerance through morethan one data fault, a second drive failure 3076 is shown to occur on“DISK 2” and affects the same parity group 2335. This disk failureoccurs subsequent to the previous disk failure in which “DISK 8” isillustrated as non-operational. The second disk failure further resultsin the loss of data integrity for the block “DATA₁1”. Using the methodof data recovery similar to that described above, the informationcontained in the data block “DATA₁1” can be recovered and redistributed3078 to another logical address within the disk array 140. The recovereddata block “DATA₁1-REC” is illustrated as being saved to available diskspace located on “DISK 5” and is stored in a disk region free ofcorruption of data fault. Thus, fault tolerance is preserved bycontinuous data restoration and storage in available non-corrupted diskspace.

[0458] The fault tolerant data recovery process demonstrates an exampleof how the distributed file storage system 100 handles data errors orcorruption in the disk array 140. An important distinction between thissystem 100 and conventional storage systems is that the aforementioneddata recovery process can automatically redistribute data or parityblocks in a dynamic and adaptable manner. Using block redistributionprocesses described above results in the distributed file storage system100 having a greater degree of fault-tolerance compared to conventionalstorage systems. In one aspect, the increase in fault tolerance resultsfrom the system's ability to continue normal operation even when one ormore drives experience a data loss or become inoperable.

[0459] In conventional storage systems, when a single disk failureoccurs, the storage system's fault tolerant characteristics arecompromised until the drive can be repaired or replaced. The lack ofability of conventional systems to redistribute data stored on thefaulted drive to other regions of the array is one reason for theirlimited fault tolerance. In these conventional systems, the occurrenceof a second drive failure (similar to that shown in FIG. 30) will likelyresult in the loss or corruption of data that was striped across both ofthe failed drives. The distributed file storage system 100 overcomesthis limitation by redistributing the data that was previously stored onthe faulted drive to a new disk area and updating the G-table whichstores the mapping information associated with the data to reflect itsnew position. As a result, the distributed file storage system 100 isrendered less susceptible to sequential drive faults even if it occurswithin the same parity group. Thus, the process of recovery andredistribution restores the fault-tolerant characteristics of thedistributed file storage system 100 and beneficially accommodatesfurther drive failures within the array 140.

[0460] Another feature of the distributed file storage system 100relates to the flexible placement of recovered data. In one aspect, arecovered data block may be stored anywhere in the DFSS through amodification of the parity group associated with the data. It will beappreciated that placement of recovered data in this manner isrelatively simple and efficient promoting improved performance overconventional systems.

[0461] In one embodiment, this feature of tolerance to multiple diskfailures results in an improved “hands-off” or “maintenance-free” datastorage system where multiple-drive failures are tolerated. Furthermore,the distributed file storage system 100 can be configured with theanticipation that if data corruption or a drive failure does occur, thesystem 100 will have enough available space within the array 140 torestore and redistribute the information as necessary. This improvedfault tolerance feature of the distributed file storage system 100reduces maintenance requirements associated with replacing or repairingdrives within the array. Additionally, the mean time between failure(MTBF) characteristics of the system 100 are improved as the system 100has reduced susceptibility to sequential drive failure or datacorruption.

[0462] In one embodiment the distributed file storage system isdesirably configured to operate in a “hands-off” environment where thedisk array incorporates additional space to be tolerant of periodic datacorruption or drive failures without the need for maintenance for suchoccurrences. Configuration of the system 100 in this manner can be moreconvenient and economical for a number of reasons such as: reducedfuture maintenance costs, reduced concern for replacement driveavailability, and reduced downtime required for maintenance.

[0463] In one aspect, the fact that parity groups may be integrated withthe file metadata provides a way for prioritizing recovery of the data.For example, when some file or set of files is designated as highlyimportant, or is frequently accessed, a background recovery process canbe performed on those designated files first. In the case where the fileis frequently accessed, this feature may improve system performance byavoiding the need for time-consuming on-demand regeneration when aclient attempts to access the file. In the case where the file is highlyimportant, this feature reduces the amount of time where a second drivefailure might cause unrecoverable data loss.

[0464]FIG. 31 illustrates one embodiment of a method 3172 for recoveringcorrupted or lost data resulting from one or more data faults. Asdiscussed above and shown the previous figure, data corruption can occuras a result of a complete drive failure or data corruption can belocalized and affect only a limited subset of logical storage blockswithin the array. The distributed storage system identifies the presenceof data corruption in a number of ways. In one aspect, the serverrecognizes corrupted data during storage or retrieval operations inwhich the one or more of the disks of the array are accessed. Theseoperations employ error checking routines that verify the integrity ofthe data being stored to or retrieved from the array. These errorchecking routines typically determine checksum values for the data whileperforming the read/write operation to insure that the data has beenstored or retrieved in a non-corrupted manner. In cases where theread/write operation fails to generate a valid checksum value, theread/write operation may be repeated to determine if the error wasspurious in nature (oftentimes due to cable noise or the like) or due toa hard error where the logical disk space where the data is stored hasbecome corrupted.

[0465] Data corruption may further be detected by the server 130 whenone or more disks 2305 within the array 140 become inaccessible.Inaccessibility of the disks 2305 can arise for a number of reasons,such as component failure within the drive or wiring malfunction betweenthe drive and the server. In these instances where one or more diskswithin the array are no longer accessible, the server 130 identifies thedata associated with the inaccessible drive(s) as being corrupted orlost and requiring restoration.

[0466] During the identification of the data fault 3175, the number andlocation of the affected logical blocks within the disk array 140 isdetermined. For each logical block identified as corrupted or lost, theserver 130 determines the parity group associated with the corrupteddata 3177. Identification of the associated parity group 2335 allows theserver 130 to implement restoration procedures to reconstruct thecorrupted data using the non-corrupted data and parity blocks 2330, 2337within the same parity group 2335. Furthermore, the logical storageblock or disk space associated with the corrupted data is identified3179 in the G-table 2530 to prevent further attempts to use thecorrupted disk space.

[0467] In one embodiment, the server 130 identifies the “bad” orcorrupted logical blocks mapped within the G-table 2530 and removes theassociated Gees from their respective parity groups thereby making theparity group shorter. Additionally, the server 130 can identifycorrupted logical blocks mapped within the G-table 2530 and remap theassociated parity groups to exclude the corrupted logical blocks.

[0468] Prior to restoring the information contained in the affectedlogical blocks, the server 130 determines the number and type of paritygroups that are required to contain the data 3180 that will subsequentlybe restored. This determination 3180 is made by accessing the G-table2530 and identifying a suitable available region within the disk array140 based on parity group allocation that can be used to store thereconstructed data. When an available parity group is found, the server130 updates the G-table 2530 to reflect the location where thereconstructed data will be stored. Additionally, the mapping structureof the array 140 is preserved by updating the links or referencescontained in Gees 2538 of the G-table 2530 to reflect the position andwhere the reconstructed data will be stored in relation to other paritygroups of the parity string. Data is then restored 3181 to the logicaldisk address pointed to by the updated Gee using the remainingnon-corrupted blocks of the parity group to provide the informationneeded for data restoration.

[0469] As previously discussed, one feature of the distributed filestorage system 100 is the use of variable length and/or variable extentparity groups. Unlike conventional storage systems that use only a fixedblock size and configuration when storing and striping data to a diskarray, the system 100 of the present invention can store data innumerous different configurations defined by the parity groupcharacteristics. In one embodiment, by using a plurality of differentparity group configurations, the distributed file storage system 100 canimprove the efficiency of data storage and reduce the inefficient use ofdisk space.

[0470]FIGS. 32A, B illustrate a simplified example of the use ofvariably sized parity groups to store files with differentcharacteristics. As shown in FIG. 32A, File #1 comprises a 4096 bytestring that is stored in the disk array 140. As previously discussed,the server 130, selects space from the plurality of parity groups 2335having different structural characteristics to store the data containedin File #1. In the illustrated embodiment, 4 exemplary parity strings3240-3243 are considered for storing File #1. Each of the parity strings3240-3243 comprises one or more parity groups 2335 that have adesignated extent based on a logical disk block size of 512 bytes. Theparity groups 2335 of each parity string 3240-3243 are furtherassociated using the G-table 2530 which link the information in theparity groups 2335 to encode the data contained in File #1.

[0471] The first parity string 3240 comprises a single 4-block paritygroup having 1024-byte data and parity blocks. The total size of thefirst parity string 3240 including all data and parity blocks is 5120bytes and has an extent value of 2. The second parity string 3241comprises two 3-block parity groups having 1024-byte data and parityblocks. The total size of the second parity string 3241 including thedata and parity blocks is 8192 bytes and has an extent value of 2. Thethird parity string 3242 comprises four 2-block parity groups having512-byte data and parity blocks. The total size of the third paritystring 3242 including the data and parity blocks is 6144 bytes and hasand extent value of 1. The fourth parity string 3243 comprises nine1-block parity groups having 512-byte data and parity blocks. The totalsize of the fourth parity string 3243 including the data and parityblocks is 8192 bytes and has an extent of 1.

[0472] Each of the parity strings 3240-3243 represent the minimum numberof parity groups 2335 of a particular type or composition that can beused to fully store the information contained in File #1. One reason forthe difference in parity group composition results from the differentnumbers of total bytes required to store the data contained in File #1.The differences in total byte numbers further result from the number andsize of the parity blocks 2337 associated with each parity group 2335.

[0473] A utilization value 3245 is shown for each parity string3240-3242 used to store File #1. The utilization value 3245 is onemetric that can be used to measure the relative efficiency of storage ofthe data of File #1. The utilization value 3245 is determined by thetotal number of bytes in the parity string 3240-3242 that are used tostore the data of File #1 compared to the number of bytes that are notneeded to store the data. For example, in the second parity string 3241,one parity group 3247 is completely occupied with data associated withFile #1 while another parity group 3246 is only partially utilized. Inone aspect, the remainder of space left in this parity group 3246 isunavailable for further data storage due to the composition of theparity group 3246. The utilization value is calculated by dividing thefile-occupying or used byte number by the total byte number to determinea percentage representative of how efficiently the data is stored in theparity string 3240-3243. Thus, the utilization values for the first,second, third, and fourth parity strings 3240-3243 are 100%, 66%, 100%,and 100% respectively.

[0474] In one embodiment, the server 130 determines how to store databased on the composition of the file and the availability of thedifferent types of parity groups. As shown in FIG. 32A, of the differentchoices for storing File #1, the first parity string 3240 is mostefficient as it has the lowest total bytes required for storage (5120bytes total), as well as, a high utilization value (100%). Each of theother parity strings 3241-3243 are less desirable for storing the datain File #1 due to greater space requirements (larger number of totalbytes) and in some cases reduced storage efficiency (lower utilizationvalue).

[0475]FIG. 32B illustrates another simplified example of the use ofvariably sized parity groups to store files of differing sizes. In theillustrated embodiment the storage characteristics of a plurality offour parity strings 3250-3253 are compared for a small file comprising asingle 1024 byte string. The parity strings comprise: The first paritystring 3250 composed of the single parity group 2335 having 4 datablocks 2330 and 1 parity block 2337, each 1024 bytes in length; Thesecond parity string 3251 composed of the single parity group 2335having 3 data blocks 2330 and 1 parity block 2337, each 1024 bytes inlength; The third parity string 3251 composed of the single parity group2335 having 2 data blocks 2330 and 1 parity block 2337, each 512 bytesin length; and The fourth parity string 3253 having two parity groups2335 each composed of the single 512-byte data block 2330 and the parityblock 2337.

[0476] When storing the byte pattern contained in File #2 differentstorage characteristics are obtained for each parity string 3250-3253.For example, the first parity string 3250 is only partially occupied bythe data of File #2 resulting in the utilization value 3245 of 25%.Similarly, the second parity string 3251 is also partially occupiedresulting in the utilization value 3245 of 33%. Conversely, the thirdand fourth parity strings 3252-3253 demonstrate complete utilization ofthe available space in the parity group (100% percent utilization).Based on the exemplary parity group characteristics given above, themost efficient storage of File #2 is achieved using the third paritystring 3252 where a total of 1536 bytes are allocated to the paritystring with complete (100%) utilization.

[0477] The aforementioned examples demonstrate how files with differingsizes can be stored in one or more parity group configurations. In eachof the above examples, the unused blocks or partially filled blocksremaining in the parity group are “zero-filled” or “one-filled” tocomplete the formation of the parity group and encode the desiredinformation from the file. Furthermore, by providing a plurality ofparity group configurations, improved storage efficiency can be achievedfor different file sizes where less space is left unutilized within thedisk array 140. It will be appreciated by one of skill in the art thatmany possible parity group configurations can be formed in a mannersimilar to those described in FIGS. 32A, B. Examples of characteristicswhich may influence the parity group configuration include: logicalblock size, extent, parity group size, parity group number, among othercharacteristics of the distributed file storage system 100. Therefore,each of the possible variations in parity group characteristics anddistribution should be considered but other embodiments of the presentinvention.

[0478] Typically, one or more selected parity groups of the availableconfigurations of parity groups provide improved storage efficiency forparticular file types. Therefore, in order to maintain storageefficiency across each different file configuration a plurality ofparity group configuration are desirably maintained by the server. Onefeature of the distributed file storage system 100 is to identifydesirable parity group configurations based on individual filecharacteristics that lead to improved efficiency in data storage.

[0479]FIG. 33 illustrates one embodiment of a data storage process 3360used by the distributed file storage system 100 to store data. Thisprocess 3360 desirably improves the efficiency of storing data to thedisk array 140 by selecting parity group configurations that haveimproved utilization characteristics and reduce unused or lost space. Inthis process 3360 the server 130 receives files 3361 from the clients110 that are to be stored in the disk array 140. The server 130 thenassesses the file's characteristics 3363 to determine suitable paritystring configurations that can be used to encode the informationcontained in the file. During the file assessment 3363, the server 130can identify characteristics such as the size of the file, the nature ofthe data contained in the file, the relationship of the file to otherfiles presently stored in the disk array, and other characteristics thatare used to determine how the file will be stored in the disk array 140.Using the G-table 2530 as a reference, the server 130 then identifies3365 available (free) parity groups that can be used to store the fileto the disk array 140.

[0480] Typically, a plurality of parity group configurations areavailable and contain the requisite amount of space for storing thefile. Using an analysis methodology similar to that described in FIGS.32A, B, the server 130 assesses the utilization characteristics for eachparity group configuration that can be used to store the file. Based onthe available configurations and their relative storage efficiency, theserver 130 selects a desirable parity group configuration 3367 to beused for file storage. In one embodiment, a desirable parity groupconfiguration is identified on the basis of the high utilization value3245 that is indicative of little or no wasted space (non-file encodingspace) within the parity groups. Furthermore, a desirable parity groupconfiguration stores the file in the parity string 2335 comprising theleast number of total bytes in the parity string. Using these twoparameters as a metric, the server 130 selects the desirable paritygroup configuration 3367 and stores the data contained in the file 3369.During file storage 3369, the G-table 2530 is updated to indicate howthe file is mapped to the disk array 140 and characteristics of theG-string 2530 used to store the file are encoded in the appropriate Geesof the G-table 2530. Furthermore, the one or more Gees corresponding tothe logical disk blocks where the data from the file is stored areupdated to reflect their now occupied status (i.e. removed from pool ofavailable or free disk space).

[0481] In another embodiment the distributed file storage system 100provides a flexible method for redistributing the parity groups 2335 ofthe disk array 140. As discussed previously, prior to storage ofinformation in the disk array 140 the distributed file storage system100 creates the G-table 2530 containing a complete map of the logicalblocks of each disk 2305 of the disk array 140. Each logical block isallocated to a particular parity group type and may be subsequentlyaccessed during data storage processes when the group type is requestedfor data storage. During initialization of the disk array 140, theserver 130 allocates all available disk space to parity groups 2335 ofvarious lengths or sizes which are subsequently used to store data andinformation. As files are stored to the disk array 140, the paritygroups 2335 are accessed as determined by the server 130 and theavailability of each parity group type changes.

[0482] Using the plurality of different sizes and configurations ofparity groups 2335 allows the server 130 to select particular paritygroup configurations whose characteristics permit the storage of a widevariety of file types with increased efficiency. In instances where afile is larger than the largest available parity group, the server 130can break down the file and distribute its contents across multipleparity groups. The G-table 2530 maps the breakdown of file informationacross the parity groups over which it is distributed and is used by theserver 130 to determine the order of the parity groups should beaccessed to reconstruct the file. Using this method, the server 140 canaccommodate virtually any file size and efficiently store itsinformation within the disk array 140.

[0483] When a large quantity of structurally similar data is stored tothe disk array 140, a preferential parity group length can be associatedwith the data due to its size or other characteristics. The resultingstorage in the preferential parity group length reduces the availabilityof this particular parity group and may exhaust the supply allocated bythe server 130. Additionally, other parity group lengths can becomeunderutilized, as the data stored to the disk array 140 does not utilizethese other parity group types in a balanced manner. In one embodimentthe distributed file storage system 100 monitors the parity setdistribution and occupation characteristics within the disk array 140and can alter the initial parity set distribution to meet the needs ofclient data storage requests on an ongoing basis and to maintain abalanced distribution of available parity group types. The parity groupmonitoring process can further be performed as a background process orthread to maintain data throughput and reduce administrative overhead inthe system 100.

[0484] FIGS. 34A-C illustrate a simplified parity set redistributionprocess useful in maintaining availability of parity groups 2335 withinthe disk array 140. Redistribution is handled by the server 130, whichcan update sets of Gees of the G-table 2530 to alter their associationwith a first parity group into an association with a second paritygroup. Furthermore, other characteristics of the data and parity blockswithin a parity group can be modified, for example, to change the sizeor extent of each block. By updating the G-table 2530, the server 140provides a parity group balancing functionality to insure that each typeor configuration of parity group is available within the disk array 140.

[0485]FIG. 34A illustrates an exemplary parity group distribution forthe disk array 140 prior to storage of data from clients 110. The paritygroup distribution comprises four types of parity groups correspondingto a 4-block parity group 3480, a 3-block parity group 3481, a 2-blockparity group 3482, and a 1-block parity group 3483. In configuring thedistributed file storage system 100 there is an initial allocation 3491of each type of parity group 3480-3483. For example, in the illustratedembodiment, 10000 groups are allocated for each type of parity group3480-3483. Each parity group 3480-3483 further occupies a calculablepercentage of a total disk space 3485 within the disk array 140 based onthe size of the parity group. Although the parity group distribution isillustrated as containing four types of parity groups, it will beappreciated by one of skill in the art that numerous other sizes andconfigurations of parity groups are possible. (e.g. 8, 10, 16, etc.) Inone embodiment, the number of blocks within the parity group 2335 can beany number less than or equal to the number of disks within the diskarray 140. Furthermore, the parity groups 2335 may be distributed acrossmore than one disk array 140 thus allowing for even larger parity groupblock numbers that are not limited by the total number of disks withinthe single disk array 140.

[0486] As disk usage occurs 3487, parity groups 3480-3483 becomeoccupied with data 3490 and, of the total initial allocation of paritygroups 3491, a lesser amount remain as free or available parity groups3492. FIG. 34B illustrates parity group data occupation statistics whereof the original initially allocated parity groups 3491 for each paritytype, a fraction remain as free or available 3492 for data storage. Morespecifically: The occupation statistics for the 4-block parity groupcomprise 2500 free vs. 7500 occupied parity groups, the occupationcharacteristics for the 3-block parity group comprise 7500 free vs. 2500occupied parity groups, the occupation characteristics for the 2-blockparity group comprise 3500 free vs. 6500 occupied parity groups, and theoccupation characteristics for the 1-block parity group comprise 500free vs. 9500 occupied parity groups.

[0487] During operation of the distributed file storage system 100, freeparity groups can become unevenly distributed such that there are agreater proportion of free parity groups in one parity group length anda lesser proportion of free parity groups in another parity grouplength. While this disparity in distribution does not necessarily impactthe performance or effectiveness of storing data to the disk array 140,the server 130 monitors the availability of each parity group 3480-3483to insure that no single parity group type becomes completely depleted.Depletion of a parity group is undesirable as it reduces the choicesavailable to the server 130 for storing data and can potentially affectthe efficiency of data storage. As shown in FIG. 34B, the 3-block paritygroup 3481 possess a greater number of free parity groups 3492 comparedto any of the other parity groups 3480, 3482, 3483 while the 1-blockparity group 3483 possess the smaller number of free parity groups andmay be subject to complete depletion should data storage continue with asimilar parity group distribution characteristics.

[0488] To prevent parity group depletion, the server 130 canredistribute or convert 3494 at least a portion of one parity group intoother parity group lengths. As shown in FIG. 34C, the server 130converts a portion of the 3-block parity group 3481 into the 1-blockparity group 3483. The resulting conversion redistributes the number ofparity groups within the disk array 140 by reducing the number of paritygroups of a first parity group type (3-block parity) and generates anadditional quantity of parity groups of the second parity group type(1-block parity). Redistribution in this manner beneficially preventsthe complete depletion of any parity group and thus preserves theefficiency of data storage by insuring that each parity group isavailable for data storage.

[0489] In one embodiment, parity group redistribution is performed byupdating one or more Gees of the G-table 2530 to reflect new paritygroup associations. As previously discussed, each parity group 2335 isassigned using a data structure linking associated Gees. Theredistribution process updates these data structures to redefine theparity group associations for the logical blocks of the disk array 140.Thus, the server 130 can rapidly perform parity group distributionwithout affecting existing occupied parity groups or significantlydegrading the performance of the distributed file storage system 100.

[0490]FIGS. 35A, B illustrate two types of parity group redistributionprocesses 3500 that are used by the system 100 to maintain parity groupavailability in the disk array 140. A first redistribution process knownas parity group dissolution 3510 converts a larger parity group into oneor more smaller parity groups. As shown in FIG. 35A, a 5-block paritygroup 3515 can be converted into two smaller parity groups consisting ofa 1-block parity group 3520 and a 3-block parity group 3525. The 5-blockparity group 3515 can also be converted into two 2-block parity groups3530 or alternatively three 1-block parity groups 3520.

[0491] A second redistribution process 3500 known as parity groupconsolidation 3535 (shown in FIG. 35B) converts two or more smallerparity groups into one or more larger parity groups. For example, two2-block parity groups 3530 can be combined to form the single 5-blockparity group 3515. Alternatively, the two 2-block parity groups 3530 canbe combined to form a 3-block parity group 3525 and a 1-block paritygroup 3525.

[0492] It will be appreciated that numerous combinations of parity groupdissolution 3510 and consolidation 3535 exist. These redistributionprocesses 3500 are advantageously used to modify the existing paritygroup configurations to accommodate the demands of the system 100 as itis populated with information. Using these processes 3500 improves theperformance and efficiency of storing data in the system 100.Consistency and knowledge of the parity group distribution is maintainedusing the G-table 2530 which is updated as the modifications to theparity groups are made. These processes 3500 can further be performedusing both occupied and unoccupied parity groups or a combinationthereof to further improve the flexibility of the distributed storagesystem 100.

[0493]FIG. 36 illustrates a process 3600 used by the server 130 tomonitor parity group availability and perform parity groupredistribution as needed. This process 3600 is important in maintaininga desirable quantity of each type of parity group so that files can bestored with improved storage efficiency. In the illustrated embodiment,the process 3600 commences with a monitoring function that determinesparity group availability 3602. The monitoring function 3602 can beperformed continuously or at periodic time intervals to insure availableparity groups remain balanced within the disk array 140. Using theG-table 2530 as a reference, the monitoring function 3602 rapidlyassesses the current status of data occupation within the array 140.More specifically, the monitoring function 3602 can determine theavailability of each type of parity group and determine the number offree or available groups using the mapping information of the G-table2530.

[0494] As a particular type of parity group is depleted 3604, indicatedby a reduction in the number of free parity groups for the particulargroup type, the server 130 proceeds to assess the parity groupstatistics 3606 for each parity group defined within the G-table 2530.The assessment of parity group statistics 3606 comprises determiningboth the free and available parity group statistics using the G-table2530 as a reference. In determining how to increase the quantity of freeparity groups for a depleted parity group type, the server 130 assesseswhich other parity groups contain available or free parity groups thathave not be used to store data. This assessment is made based upon theparity group usage statistics which, for example, indicate free paritygroups, occupied parity group, disk space occupation, frequency ofaccess or utilization, among other statistics that can be collectedwhile the distributed file storage system 100 is in operation.

[0495] In one embodiment, the server 130 continually collects and storesusage statistics so as to provide up-to-date and readily availablestatistical information that can be used to determine how redistributionof available parity groups should proceed. Additionally, thesestatistics can be acquired from the G-table 2530 where the server 130calculates the usage statistics based upon the current contents of theG-table 2530.

[0496] Upon acquiring the parity group statistics 3606, the server 130calculates a suitable re-distribution 3608 of the parity groups. There-distribution 3608 desirably takes into account factors such as, forexample, the number and type of parity groups 2335 within the disk array140, the availability of unoccupied parity groups within each paritygroup type, the frequency of usage or access of each parity group type,among other considerations that can be determined using the parity groupstatistics. During parity group redistribution 3608, one or moredifferent parity groups can be used as a source for supplementing thedepleted parity group set. The overall effect of redistribution 3608 isto balance the free or available parity groups of each type so that noone single parity group is depleted.

[0497] Parity group redistribution in the aforementioned manner isfacilitated by the use of the G-table 2530 mapping structure. Using theG-table 2530, parity groups can be readily assigned and re-assignedwithout significant overhead by modifying the contents of appropriateGees. This method of disk space allocation represents a significantimprovement over conventional disk storage methods such as those used inRAID architectures. In conventional RAID architectures, the rigid natureof disk space allocation prevents optimizing data storage in the mannerdescribed herein. Furthermore, the parity group redistribution featureof the distributed file storage system 100 provides an effective methodto monitor and maintain optimized disk storage characteristics withinthe array to insure efficient use of available disk space.

[0498] In addition to redistributing free or available space within thedisk array 140, the distributed file storage system 100 also features amethod by which occupied parity groups can be modified and re-configuredinto other parity group types. One benefit realized by re-configuringoccupied parity groups is that unnecessary space allocated to aparticular parity group in which data is stored may be reclaimed for useand converted to available or free storage space. Furthermore,re-configuration of occupied parity groups can be used to de-fragment orconsolidate the information stored in the disk array 140 enabling theinformation contained therein to be accessed more efficiently.

[0499]FIG. 37 illustrates one embodiment of a parity groupoptimization/de-fragmentation routine used to re-configure data withinthe disk array 140. Parity group occupation statistics are shown fordifferent parity lengths including: a 1-block parity group having 2800free parity groups and 7200 occupied parity groups, a 2-block paritygroup having 1800 free parity groups and 8200 occupied parity groups, a3-block parity group having 800 free parity groups and 9200 occupiedparity groups, and a 4-block parity group having 2300 free parity groupsand 7700 occupied parity groups.

[0500] When the server 130 performs an optimization routine 3785, one ormore of the parity groups can be re-configured into another type ofparity group. For example, as shown in the illustration, a portion ofthe 1-block parity groups corresponding to 3200 groups can beconsolidated into 2000 groups of 4-block parity. In the consolidatedparity groups, the original information contained in the 1-block paritygroup is retained in a more compact form in the 4-block parity groups.The resulting 4-block parity groups require less parity information tomaintain data integrity compared to an equivalent quantity ofinformation stored in a 1-block parity configuration. In the illustratedembodiment, the residual space left over from the optimization routinecorresponds to approximately 1200 groups of 1-block parity and can bereadily converted into any desirable type of parity group using G-tableupdating methods.

[0501] The aforementioned optimization routine can thereforebeneficially re-allocate occupied logical disk blocks into differentparity group configurations to reclaim disk space that might otherwisebe lost or rendered inaccessible due to the manner in which the data isstored in the parity groups. As with other parity group manipulationmethods provided by the distributed file storage system 100, the processof optimizing parity groups is readily accomplished by rearrangement ofthe mapping assignments maintained by the G-table 2530 and provides asubstantial improvement in performance compared to conventional storagesystems. In conventional systems, data restriping is a time consumingand computationally expensive process that reduces data throughput andcan render the storage device unavailable while the restriping takesplace.

[0502] Like conventional storage systems, the distributed file storagesystem 100 provides complete functionality for performing routine dataand disk optimization routines such as de-fragmentation of logical blockassignments and optimization of data placement to improve access timesto frequently accessed data. These processes are efficiently handled bythe system 100, which can use redundant data access to insureavailability of data disk optimization routines take place.

[0503] The distributed file storage system 100 further provides adaptiveload balancing characteristics that improve the use of resourcesincluding servers 130 and disk arrays 140. By balancing the load betweenavailable resources, improved data throughput can be achieved whereclient requests are routed to less busy servers 130 and associated diskarrays 140. Load-dependent routing in this manner reduces congestion dueto frequent accessing of a single server or group of servers. Additionaldetails of these features can be found in those discussions relating toadaptive load balancing and proactive control of the DFSS 100.

[0504] In one embodiment, frequently accessed data or files areautomatically replicated such that simultaneous requests for the sameinformation can be serviced more efficiently. Frequently accessed datais identified by the servers 130 of the distributed file storage system100, which maintain statistics on resource usage throughout the network.Furthermore, the servers 130 can use the resource usage statistics inconjunction with predictive algorithms to “learn” content accesspatterns. Based on these access patterns frequently accessed content canbe automatically moved to server nodes 150 that have high bandwidthcapacities capable of serving high numbers of client requests.Additionally, less frequently accessed material can be moved to servernodes 150 that have higher storage capacities or greater availablestorage space where the data or files can be conveniently stored inareas without significant bandwidth limitations.

[0505]FIG. 38 illustrates one embodiment of a load balancing method 3800used in conjunction with the distributed file storage system 100 toprovide improved read/write performance. In the load balancing method3800, file operations are performed 3851 and file access statistics arecontinuously collected 3852 by the servers 130. These statistics includeinformation describing file access frequencies, file sizecharacteristics, file type characteristics, among other information.Resource utilization statistics are also collected 3854 and containinformation that characterize how data is stored within the distributedfile storage system 100. The resource utilization statistics identifyhow each disk array 140 is used within the system 100 and may containstatistics that reflect the amount of free space within the array, theamount of used space within the array, the frequency of access of aparticular disk within the disk array, the speed of servicing clientrequests, the amount of bandwidth consumed servicing client requests andother statistics that characterize the function of each disk array 140within the distributed file storage system 100. The resource utilizationstatistics can also be used to evaluate the statistics across multipledisk arrays to determine how each disk array compares to other diskarrays within the distributed file storage system 100. This informationis useful in identifying bandwidth limitations, bottlenecks, disk arraysoverloads, and disk array under utilization.

[0506] Using either the resource utilization statistics 3854, the fileaccess statistics 3852, or a combination thereof, the one or moreservers 130 of the distributed file storage system 100 predict futurefile and resource utilization characteristics 3856. In one embodiment,the future file and resource utilization characteristics 3856 describe apredicted workload for each of the disk arrays within the distributedfile storage system 100. The predicted workload serves as a basis fordetermining how to best distribute the workload 3858 among availableservers and disk arrays to improve access times and reduce bandwidthlimitations. Furthermore, the predicted workload can be used todistribute files or content 3860 across the available disk arrays tobalance future workloads.

[0507] An additional feature of the distributed file storage system 100is the ability to perform “hot upgrades” to the disk array 140. Thisprocess can involve “hot-swapping” operations where an existing diskwithin the array is replaced (typically to replace a faulted ornon-operational drive). Additionally, the “hot upgrade” process can beperformed to add a new disk to the existing array of disks withoutconcomitant disk replacement. The addition of the new disk in thismanner increases the storage capacity of the disk array 140automatically and eliminates the need to restrict access to the diskarray 140 during the upgrade process in order to reconfigure the system100. In one embodiment, the server 130 incorporates the additional spaceprovided by the newly incorporated disk(s) by mapping the disk spaceinto existing unused/available parity groups. For example, when a newdrive is added to the disk array 140, the server 130 can extend thelength or extent of each available parity group by one. Subsequently,parity group redistribution processes can be invoked to optimize anddistribute the newly acquired space in a more efficient manner asdetermined by the server 130. In one embodiment, when there are morenewly added logical disk blocks than can be accommodated by addition tothe unused parity groups, at least some of the unused parity groups aresplit apart by the dissolution process to create enough unused paritygroups to incorporate the newly added logical disk blocks.

[0508] Load Balancing

[0509] One approach to adaptive or active load balancing includes twomechanisms. A first mechanism predicts the future server workload, and asecond mechanism reallocates resources in response to the predictedworkload. Workload prediction can have several aspects. For example, oneaspect includes past server workload, such as, for example, file accessstatistics and controller and network utilization statistics. Theloading prediction mechanism can use these statistics (with anappropriate filter applied) to generate predictions for future loading.For example, a straightforward prediction can include recognizing that afile that has experienced heavy sequential read activity in the past fewminutes will likely continue to experience heavy sequential read accessfor the next few minutes.

[0510] Predictions for future workload can be used to proactively manageresources to optimize loading. Mechanisms that can be used to reallocateserver workload include the movement and replication of content (filesor objects) between the available storage elements such that controllerand storage utilization is balanced, and include the direction of clientaccesses to available controllers such that controller and networkutilization is balanced. In one embodiment, some degree of cooperationfrom client machines can provide effective load balancing, but clientcooperation is not strictly needed.

[0511] Embodiments of the invention include a distributed file server(or servers) comprising a number of hardware resources, includingcontrollers, storage elements such as disks, network elements, and thelike. Multiple client machines can be connected through a client networkor communication fabric to one or more server clusters, each of whichincludes of one or more controllers and a disk storage pool.

[0512] File system software resident on each controller can collectstatistics regarding file accesses and server resource utilization. Thisincludes information of the access frequency, access bandwidth andaccess locality for the individual objects stored in the distributedfile, the loading of each controller and disk storage element in termsof CPU utilization, data transfer bandwidth, and transactions persecond, and the loading of each network element in terms of networklatency and data transfer bandwidth.

[0513] The collected statistics can be subjected to various filteroperations, which can result in a prediction of future file and resourceutilization (i.e. workload). The prediction can also be modified byserver configuration data which has been provided in advance, forexample, by a system administrator, and explicit indications regardingfuture file and/or resource usage which may be provided directly from aclient machine.

[0514] The predicted workload can then be used to move content (files,objects, or the like) between storage elements and to direct clientaccesses to controllers in such a manner that the overall workload isdistributed as evenly as possible, resulting in best overall loadbalance across the distributed file storage system and the best systemperformance.

[0515] The predicted workload can be employed to perform client networkload balancing, intra-cluster storage load balancing, inter-node storageload balancing, intra-node storage capacity balancing, inter-nodestorage capacity balancing, file replication load balancing, or thelike.

[0516] Client network load balancing includes managing client requeststo the extent possible such that the client load presented to theseveral controllers comprising a server cluster, and the load presentedto the several client network ports within each is evenly balanced.Intra-cluster storage load balancing includes the movement of databetween the disks connected to a controller cluster such that the diskbandwidth loading among each of the drives in an array, and the networkbandwidth among network connecting disk arrays to controllers isbalanced. For example, intra-cluster storage load balancing can beaccomplished by moving relatively infrequently accessed files orobjects. Intra-cluster storage load balancing advantageously achievesuniform bandwidth load for each storage sub-network, while alsoachieving uniform bandwidth loading for each individual disk drive.

[0517] Inter-node storage load balancing comprises the movement of databetween drives connected to different controller clusters to equalizedisk access load between controllers. This can often cost more thanintra-node drive load balancing, as file data is actually copied betweencontrollers over the client network. Intra-node storage capacitybalancing comprises movement of data between the disks connected to acontroller (or controller pair) to balance disk storage utilizationamong each of the drives.

[0518] Inter-node storage capacity balancing comprises movement of databetween drives connected to different controllers to equalize overalldisk storage utilization among the different controllers. This can oftencost more than intra-node drive capacity balancing, as file data isactually be copied between controllers over the network. Filereplication load balancing comprises load balancing through filereplication as an extension of inter-node drive load balancing. Forexample, high usage files are replicated so that multiple controllerclusters include one or more that one local (read only) copy. Thisallows the workload associated with these heavily accessed files to bedistributed across a larger set of disks and controllers.

[0519] Based on the foregoing, embodiments of the present inventioninclude a distributed file storage system that proactively positionsobjects to balance resource loading across the same. As used herein,load balancing can include, among other things, capacity balancing,throughput balancing, or both. Capacity balancing seeks balance instorage, such as the number of objects, the number of Megabytes, or thelike, stored on particular resources within the distributed file storagesystem. Throughput balancing seeks balance in the number of transactionsprocessed, such as, the number of transactions per second, the number ofMegabytes per second, or the like, handled by particular resourceswithin the distributed file storage system. According to one embodiment,the distributed file storage system can position objects to balancecapacity, throughput, or both, between objects on a resource, betweenresources, between the servers of a cluster of resources, between theservers of other clusters of resources, or the like.

[0520] The distributed file storage system can proactively positionobjects for initial load balancing, for example, to determine where toplace a particular new object. While existing server loading is a factorused in the determination, other data can be used to help predict theaccess frequency of the new object, such as, for example, fileextensions, DV access attributes, or the like. For example, a fileextension indicating a streaming media file can be used to predict alikely sequential access to the same.

[0521] The distributed file storage system actively continues loadbalancing for the existing objects throughout the system using loadbalancing data. For capacity load balancing, large objects predicted tobe infrequently accessed, can be moved to servers, which for example,have the lower total percent capacity utilizations. Movement of suchfiles advantageously avoids disrupting throughput balancing by movingpredominantly infrequently accessed files. For throughput balancing,objects predicted to be frequently accessed can be moved to servers,which for example, have the lower total percent transactionutilizations. In one embodiment, smaller objects predicted to befrequently accessed can be moved in favor of larger objects predicted tobe frequently accessed, thereby advantageously avoiding the disruptionof capacity balancing.

[0522] According to one embodiment, one or more filters may be appliedduring initial and/or active load balancing to ensure one or a small setof objects are not frequently transferred, or churned, throughout theresources of the system.

[0523] The distributed file storage system can comprise resources, suchas a server or server, which can seek to balance the loading across thesystem by reviewing a collection of load balancing data from itself, oneor more of the other servers in the system, or the like. The loadbalancing data can include object file statistics, server profiles,predicted file accesses, historical statistics, object patterns, or thelike. A proactive object positioner associated with a particular servercan use the load balancing data to generate an object positioning plandesigned to move objects, replicate objects, or both, across otherresources in the system. Then, using the object positioning plan, theresource or other resources within the distributed file storage systemcan execute the plan in an efficient manner.

[0524] According to one embodiment, the generation of the positioningplan can be very straightforward, such as, for example, based on objectsizes and historical file access frequencies. Alternatively, thegeneration of the plan can be quite complex, based on a large variety ofload balancing information applied to predictive filtering algorithms,the output of which is a generally more accurate estimate of future fileaccesses and resource usage, which results in more effective objectpositioning. Another embodiment can include adaptive algorithms whichtrack the accuracy of their predictions, using the feedback to tune thealgorithms to more accurately predict future object access frequencies,thereby generating effective object positioning plans.

[0525] According to one embodiment, each server pushes objects definedby that server's respective portion of the object positioning plan tothe other servers in the distributed file storage system. By employingthe servers to individually push objects based on the results of theirobject positioning plan, the distributed file storage system provides aserver-, process-, and administrator-independent automated approach toobject positioning, and thus load balancing, within the distributed filestorage system.

[0526] To facilitate a complete understanding of exemplary loadbalancing aspects of the invention, this part of the detaileddescription describes the invention with reference to FIGS. 39-41,wherein like elements are referenced with like numerals throughout.

[0527]FIG. 39 depicts an exemplary embodiment of servers and disk arraysof a distributed file storage system (DFSS) 3900, disclosed for thepurpose of highlighting the distributed proactive object positioningaspects of an exemplary embodiment of the invention. A skilled artisanwill recognize FIG. 39 is not intended to limit the large number ofpotential configurations of servers and disk arrays encompassed by theforegoing distributed file storage system 100 disclosed with referenceto FIG. 1. As shown in FIG. 39, the DFSS 3900 comprises five nodesformed into three clusters 3905, 3910, and 3915. Cluster 3905 includes afirst node comprising server F1 and a disk array 3920, and a second nodecomprising server F2 and a disk array 3922. Cluster 3910 includes onenode comprising server F3 and a disk array 3924. Additionally, cluster3915 includes a first node comprising server F4 and a disk array 3926,and a second node comprising server F5 and a disk array 3928.

[0528] According to one embodiment, each of the servers F1, F2, F3, F4,and F5 comprises software, hardware, and communications similar to theservers 130-135 disclosed with reference to FIGS. 1 and 2. For example,server F1 communicates with each drive of the disk array 3920.Additionally, server F1 forms part of cluster 3905. According to oneembodiment, at least some of the objects stored on a disk array within acluster, are stored, and are thereby accessible, on other disk arrayswithin the cluster. For example, server F1 can be configured tocommunicate with each drive of the disk array 3922. Server F1 alsocommunicates with one or more of the other servers of the DFSS 3900.Moreover, the servers F1, F2, F3, F4, and F5 include software andhardware systems which employ some or all of the features of thedistributed file storage system 100, such as, for example, the discloseduse of metadata structures for object organization, metadata and datacaching, and the like.

[0529]FIG. 39 also shows exemplary self-explanatory attributes of eachof the drives of the disk arrays 3920-3928. For example, the drives ofthe disk array 3920 include two high speed drives having small storagecapacity, e.g., “FAST, SMALL,” one drive having high speed and averagestorage capacity, e.g., “FAST, AVERAGE,” and one drive having averagespeed and large storage capacity, e.g., “AVERAGE, LARGE.” Additionally,FIG. 39 shows servers F3 and F4 providing access to a resource, such as,for example, a printer, scanner, display, memory, or the like. A skilledartisan will recognize from the disclosure herein that the speed of adrive includes its ordinary meaning as well as a measure of the datarate, or the like, of read or write operations.

[0530] According to one embodiment, the DFSS 3900 includes proactiveobject positioning. For example, each server F1-F5 of the DFSS 3900proactively positions objects, such as files, directories, or the like,based on a desire to balance or optimize throughput, capacity, or both.According to one embodiment, the foregoing balancing and optimizationcan advantageously occur at multiple levels within the DFSS 3900. Forexample, the DFSS 3900 can advantageously seek to optimize the placementand structure of objects within and between disks of the disk arrays,between the servers of a cluster and between servers of other clusters.

[0531] Load Balancing Within and Between the Drives of the Disk Arrays

[0532] Similar to the embodiments disclosed with reference to FIGS. 1and 5, the DFSS 3900 provides the server F1 with the ability to adjustthe file logical block size and the distribution of files acrossmultiple drives using, for example, the Gee Table 320. Thus, the serverF1 can adjust or choose the layout of particular files within a disk,using, for example, larger file logical block sizes for larger files, orthe like, thereby creating efficient storage of the same. Moreover, theserver F1 can adjust or choose the layout of particular files acrossvarying numbers of disks, thereby matching, for example, performance ofdrives within the disk array 3920 with attributes of particular files.

[0533] For example, FIG. 39 shows the placement of two files within theDFSS 3900, e.g., streamed file “SF” and large file “LF.” According tothe exemplary embodiment, file “SF” comprises a file which is to bestreamed across computer networks, such as, for example, the Internet.As shown in FIG. 39, file SF is stored in the disk array 3920 using adistributed parity group of three blocks, e.g., two data blocks, “SF₁,”and “SF₂,” and one parity block “SF₃.” Similar to the foregoingdescription of distributed file storage system 100, the DFSS 3900advantageously allows files to modify the number of drives in thedistributed parity group for a variety of reasons, including to takeadvantage of attributes of a disk array. Thus, when it is determinedthat it is desirable to store file SF on only fast disk drives, thedistributed parity group can be chosen such that file SF is stored onthe fastest drives of disk array 3920 in equally shared portions. Askilled artisan will recognize from the disclosure herein that theservers advantageously balance the desire to employ the faster drives ofa particular disk array, against the desire to reduce the overheadassociated with using smaller parity groups. For example, according tosome embodiments, use of only two disks of five disks means that half ofthe data stored is overhead parity data.

[0534]FIG. 39 also shows that in the disk array 3922, file SF′, a copyof file SF, can be stored according to the attributes of the disk array3922, e.g., file SF′ is stored using a distributed parity group of twobecause the disk array 3922 has only two fast drives. Moreover, FIG. 39shows file LF stored in the disk array 3924. According to the exemplaryembodiment, file LF is stored is using distributed parity groups ofthree blocks, thereby fully taking advantage of all three very fastdrives.

[0535] Thus, the server F1 advantageously and proactively can adjust theplacement and structure of objects, such as files, within and betweendrives of the disk array 3920. A skilled artisan will recognize thatsuch proactive placement is outside the ability of conventional datastorage systems. For example, as disclosed with reference to FIGS.14-16, the DFSS 3900 advantageously includes a directory and file handlelookup process which allows the clients 110 to find files without firstknowing which server is currently storing the file. Thus, when one ofthe servers of the DFSS 3900 repositions an object to balance load,capacity, or the like, the clients 110 can use the lookup process tofind the repositioned object in its new location.

[0536] Load Balancing Between Servers of a Cluster

[0537] As disclosed in the foregoing, one embodiment of the DFSS 3900seeks to balance the loading and capacity between servers of a cluster.As disclosed with reference to the embodiments of FIGS. 1 and 13-14, theclients 110 request data from a file through the use of the file handle1300, which according to one embodiment, includes the serveridentification 1320. Thus, the DFSS 3900 can advantageously alter theserver identification 1320 of the file handle 1300 for a particularfile, thereby changing the read or write request from being processedby, for example, server F1 to, for example, server F2. A skilled artisanwill recognize a wide number of reasons for making the foregoingalteration of the file handle 1300, including the availability of F1,the load of F1 versus F2, or the like. In addition, the DFSS 3900 canalter the file handle 1300 based on comparisons of server load balancingdata, to set up read-only copies of heavily accessed files, or the like,as discussed below.

[0538] Load Balancing Between Servers of Other Clusters

[0539] Load balancing between servers differs from load balancingbetween drives in, among other things, load balancing between serversinvolves balancing through the movement or creation of additional copiesof objects, while load balancing between drives involves the movement ofdata blocks.

[0540] One embodiment of the DFSS 3900 comprises servers F1-F5 eachhaving access to load balancing data from itself and each of the otherservers. According to one embodiment, each server uses the loadbalancing data to generate an object positioning plan, and then pushesobjects defined by their respective portion of the plan, to otherservers in the DFSS 3900. The foregoing implementation provides adistributed and server-independent approach to object positioning withinthe DFSS 3900. It will be understood by a skilled artisan from thedisclosure herein that resources, or groups of resources, can gatherload balancing data, such as, for example, each, some, or all clusters,each, some, or all servers, or the like.

[0541] According to one embodiment, the load balancing data of aparticular server can include a wide variety of statistical andattribute data relating to the architecture and performance of therespective server and disk array. Additional statistical information canbe maintained relating to the historical object access frequencies andpatterns. This statistical information can be applied to a filteringfunction to predict future object frequencies and patterns.

[0542] The load balancing data can include relatively staticinformation, such as, for example, the number of servers for a givencluster and the number of drives connected to each server. Moreover, foreach server, the load balancing data can include an indication of thenumber and type of interfaces available to the server, performancestatistics of the server, amount of available memory, an indication ofthe health of the server, or the like. For each drive, the loadbalancing data can include an indication of the layout of the drive,such as track information, cylinder information, or the like, capacityand performance information, performance statistics, an indication ofthe health of the drive, or the like. Additionally, the load balancingdata can include an indication of the performance and the health ofstorage network configurations, client network configurations, or thelike. The relatively static load balancing data can be considered the“profile” of the resources associated therewith.

[0543] Other relatively static information can include an indication ofthe quality of service being demanded by the clients 110 from aparticular server, such as, for example, server F1 and its associateddisk array 3920 can be configured to provide data availability withlittle or no downtime, thereby allowing the server to support Internethosting applications or the like. Additionally, the foregoing relativelystatic statistical or attribute information can change occasionally,such as, for example, when a drive is replaced or added, a server isreconfigured, the quality of service is changed, or the like.

[0544] According to yet another embodiment, the load balancing data canalso include relatively dynamic information, such as, for example,throughput information like the number of read or write input/outputoperations per second (IOPS). For example, the dynamic information caninclude server throughput for each server, such as, for example, clienttransactions per second, client megabytes per second, disk transactionper second, disk megabytes per second, or the like. The foregoing serverthroughput information can include read, write, or both operations foreach client interface of the particular server. The server throughputdata also includes dynamic information such as the cache hit ration,errors, or the like, of each particular server. The dynamic informationcan also include disk throughput for each disk, such as, for example, anindication of the amount of metadata capacity that is being utilized,the amount of data capacity utilized, read, write, or both transactionsper second, read, write, or both megabytes per second, errors or thelike.

[0545] In addition to the foregoing data, the load balancing dataincludes object statistic information, such as, for example, the lastaccess time and the access frequency for each object. According to oneembodiment, the measurement of access frequency can be filtered usingone or more filtering weights designed to emphasize, for example, morerecent data over more historical data.

[0546] According to one embodiment, each server may include filestatistical information in the load balancing data, comprisingadditional information for the more heavily accessed, and potentiallysmaller, objects. For example, the file statistical information caninclude an indication of access frequency for, for example, the last ten(10) minutes, one (1) hour, twenty-four (24) hours, or the like.Moreover, the file statistical information can include average readblock size, average write block size, access locality, such as aindication of randomness or sequentialness for a given file, histogramdata of accesses versus day and time, or the like. According to oneembodiment, the indication of randomness can include randomness rating,such as, for example, a range from 0 and 1, where 0 corresponds toprimarily randomly accessed file and one corresponds to a primarilysequentially accessed file, or vice versa.

[0547] Based on the above, the load balancing data for a given servercan include virtually any information, performance or attributestatistic, or the like that provides insight into how objects, such asfiles and directories, should be created, reconfigure, moved, or thelike, within the DFSS 3900. For example, a skilled artisan can includeadditional information useful in the prediction of file accessfrequencies, such as, for example, the time of day, the file size, thefile extension, or the like. Moreover, the additional information caninclude hints corresponding to dynamic volume access attributes, suchas, for example, block size, read/write information, the foregoingquality of service guarantees or the randomness/sequentialness of fileaccess.

[0548] According to one embodiment, the load balancing data can includea Least Recently Used (LRU) stack and/or a Most Recently Used (MRU)stack, advantageously providing insight into which objects can be usedfor balancing capacity, throughput, or both, within the DFSS 3900. Forexample, according to one embodiment, the LRU stack tracks the objectsthat are rarely or infrequently accessed, thereby providing informationto the servers about which objects can be mostly ignored for purposes ofthroughput balancing, and are likely candidates for capacity balancing.The MRU stack tracks the objects that are more frequently accessed,thereby providing information to the servers about which objects arehighly relevant for throughput balancing. According to one embodiment,the servers F1-F5 can employ the MRU stack to determine the objects, onwhich the servers should be tracking additional performance statisticsused in more sophisticated load balancing or sharing solutions, asdiscussed in the foregoing.

[0549] A skilled artisan will recognize from the disclosure herein thatthe MRU and LRU stacks can be combined into a single stack or otherstructure tracking the frequency of access for some or all of theobjects of the servers F1-F5. A skilled artisan will also recognize fromthe disclosure herein that the time frame chosen for determiningfrequency of use for a given object affects the throughput and capacitybalancing operations. For example, if the time frame is every twelvehours, the number of objects considered to be frequently accessed may beincreased as compared to a time frame of every half-second. According toone embodiment, the DFSS 3900 uses an adaptive time frame of ten (10)minutes to twenty-four (24) hours.

[0550] Although the load balancing data is disclosed with reference toits preferred embodiment, the invention is not intended to be limitedthereby. Rather, a skilled artisan will recognize from the disclosureherein a wide number of alternatives for the same. For example, the loadbalancing data can include detailed performance statistics similar tothose disclosed above. On the other hand, the load balancing data caninclude only a few data points providing only a rough sketch of thethroughput and capacity on a particular server. Moreover, the server maytrack access frequency using information contained in the G-Node of anobject, such as, for example, the last access time, or “atime,” field.

[0551]FIG. 40 illustrates a block diagram of an exemplary server 4000,such as the servers F1-F5 of FIG. 39, according to aspects of anexemplary embodiment of the invention. As shown in FIG. 40, the server4000 include a server interface 4005, a server software or file system4010, load balancing data 4020, and an object positioning plan 4025. Theserver interface 4005 passes data access requests from, for example, theclients 110, to the file system 4010. The server interface 4005 includesa server manager 4008, which collects client access statistics, such astransactions per second per client, per port, and per server, andmegabytes per second per client, per port, and per server. The serversystem 4010 includes several layers that participate in statisticscollection. For example, the server system 4010 includes a requestprocessing layer 4012, a data/metadata management layer 4014, and astorage management layer 4016. The request processing layer 4012collects the statistics related to accesses to specific files. Thedata/metadata management layer 4014 collects drive resource and capacityutilization information. The storage management layer 4016 collectsstatistics related to transactions per second and megabytes per secondfor each storage network interface and drive.

[0552]FIG. 40 also shows that each server 4000, such as the serversF1-F5 of FIG. 39, includes a proactive object positioner 4018, accordingto aspects of an exemplary embodiment of the invention. According to oneembodiment, the positioner 4018 comprises a set of rules, a softwareengine, or the like, employing logic algorithms to some or all of theload balancing data 4020 to generate the object positioning plan 4025.

[0553] As disclosed in the foregoing, the servers F1, F2, F3, F4, andF5, each share their respective load balancing data with one another.Thus, the load balancing data 4020 comprises load balancing data fromthe particular server, in this example, server F3, and the loadbalancing data from each of the other servers, F1-F2 and F4-F5.According to one embodiment, a server transmits its load balancing dataat predetermined time intervals. According to another embodiment, eachserver determines when a significant change or a time limit has expiredsince the last broadcast of its load balancing data, and then broadcaststhe same.

[0554] As shown in FIG. 40, each server 4000 includes the proactiveobject positioner 4018, which accepts as an input, the load balancingdata of the some or all of the servers, and generates as an output, theobject positioning plan 4025. According to one embodiment, the proactiveobject positioner 4018 for a given server generates a plan for thatserver. The server then attempts to push objects found in the plan tothe other servers in the DFSS 3900 to balance throughput, capacity, orboth. According to another embodiment, the proactive object positioner4018 for a given server generates the plan 4025, which is relevant toall servers. In such a case, the server attempts to push only itsobjects from the plan 4025 to other servers. Thus, each server in theDFSS 3900 acts independently to accomplish the plan 4025 of the entireDFSS 3900, thereby advantageously providing a distributed and balancedapproach that has no single point of failure and needing, if any, onlyminimal supervision.

[0555] As discussed in the foregoing, the object positioner 4018corresponding to each server in the DFSS 3900 can generate thepositioning plan 4025 to position objects to balance capacity,throughput, or both.

[0556] Positioning to Balance Capacity, Such as the Number or Size ofObjects

[0557] According to one embodiment, the proactive object positioner 4018for each server can instruct its server to balance the number of objectsstored on some or each disk array of the DFSS 3900. For example, asdisclosed with reference to FIG. 5, each server has a predefined amountof memory for caching the G-nodes of the objects stored on the diskarray associated with that server. By balancing the number of objectsrelated to a particular server, the DFSS 3900 advantageously avoidshaving more G-node data for a server than can be stored in that server'sG-node memory cache.

[0558] According to one embodiment, the proactive object positioner 4018for each server can instruct its server to balance the size of objectsstored on some or each disk array of the DFSS 3900. For example, if aparticular server is associated with a disk array having a large numberof small objects stored therein, the server can exceed that server'sG-node memory cache. Therefore, each proactive object positioner 4018can instruct its server to push objects such that the size of objectsaccessible by each server is balanced. For example, the servers canevenly distribute the number of small objects, the number ofmedium-sized objects, and the number of large objects between servers.By balancing the size of objects related to a particular server, theDFSS 3900 reduces the chances of having more G-node data for a serverthan can be stored in that server's G-node memory cache.

[0559] According to yet another embodiment, the proactive objectpositioner 4018 for each server can instruct its server to optimize thenumber of free and used data blocks when the servers in the DFSS 3900have a large average object size. In such case, the number of G-nodesand the G-node memory cache will not likely be a performance issue,although number of used versus free data blocks will likely be an issue.While used versus free data blocks need not be entirely uniform acrossservers, maintaining a certain level of unused block capacity for eachserver provides flexibility in throughput balancing and new objectcreation, thereby enhancing the performance of the overall DFSS 3900.

[0560] Positioning to Balance Throughput, Such as the Access Frequencyof Objects

[0561] According to one embodiment, the proactive object positioner 4018generates the positioning plan 4025 to position objects based on, forexample, predicted access frequencies of the same. As discussed above,prediction may comprise historical data, and may comprise a number ofother data and factors as well. The positioner 4018 can advantageouslyuse objects predicted to be infrequently accessed for capacity balancingto avoid upsetting any throughput balancing already in place. Forexample, when the positioner 4018 determines to balance the capacityamong resources of the DFSS 3900, such as, for example, a drive, diskarray, or server, the positioner 4018 can move objects that are oflittle significance to the throughput of the resource, such as, forexample, those objects predicted to be least accessed. Thus, as thepositioner 4018 balances the capacity through objects predicted to be,or found to be least recently accessed, the respective throughput of theresources will not be substantially affected. According to oneembodiment, each server tracks the objects predicted to be infrequentlyused by maintaining in their load balancing data, an LRU stack of, forexample, pointers to the G-Nodes of the objects predicted to beinfrequently accessed.

[0562] Additionally, the positioner 4018 can generate the positioningplan 4025 to move objects predicted to be infrequently accessed fromfaster drives to slower drives. For example, if the large file LF fromFIG. 39 were predicted to be infrequently accessed, storage of file LFon the fastest drives of the DFSS 3900, for example, the drives of thedisk array 3924, would be inefficient. Thus, the proactive objectpositioner 4018 determines that the large file LF predicted to beinfrequently accessed can be advantageously stored on the slow, largedrives of the disk array 3926 of server F4. A skilled artisan willrecognize that movement of the file LF to servers F4 is not expected tosubstantially affect the throughput of servers F3 and F4, outside of theprocesses for moving the file LF.

[0563] Additionally, the proactive object positioner 4018 can use theMRU stack in a server's load balancing data to instruct an overburdenedserver to take actions to offload some of the access from itself tothose servers with less throughput. For example, the positioner 4018 cangenerate instructions to move the objects predicted to be heavilyaccessed to other servers, thereby moving the entire throughput loadassociated therewith, to the other servers. Also, positioner 4018 cangenerate instructions to create copies of objects predicted to beheavily accessed on other servers, thereby sharing the throughput loadwith the other servers.

[0564] For example, according to one embodiment, the server F1 includesthe streamed file SF predicted to be heavily accessed, which in thisexample may include extremely popular multimedia data, such as, forexample, a new video or music release, a major news story, or the like,where many clients are requesting access of the same. Moreover,according to this embodiment, the server F1 is being over-utilized,while the server F3 is being under-utilized. Thus, the object positioner4018 recognizes that the movement of the file SF to the server F3 maysimply overload the server F3. According to one embodiment, theproactive object positioner 4018 can instruct the server F1 to push, forexample, read-only copies of the file SF to the server F3. Moreover, askilled artisan will recognize from the disclosure herein that theserver F1 can then return to a requesting client, a file handle 1300 forthe file SF designating server F3, and the client will then generaterequests to server F3, instead of server F1. Accordingly, the overutilization of server F1 is advantageously decreased while the underutilization of server F3 is advantageously increased, thereby balancingthe throughput across the DFSS 3900.

[0565] According to yet another embodiment, the proactive objectpositioner 4018 can generate instructions to move objects to match theattributes of resources available to a particular server, therebypotentially decreasing the response time of the DFSS 3900. For example,as illustrated in the foregoing embodiment, the object positioner 4018can instruct the server F1 to push the file SF predicted to be heavilyaccessed, to the server F3 having very fast disk drives, even when theserver F1 is not being over-utilized. Moreover, as discussed above, thepositioner 4018 can instruct the server F3 to store the file indistributed parity groups matching the number of very fast drives.

[0566] According to one embodiment, one or more of the servers caninclude specific software and hardware solutions, such as dedicateddigital signal processors, which can add additional horse power to thegeneration of the object positioning plan 4025. For example, loadbalancing can be performed by an external client connected to the DFSS3900.

[0567]FIG. 41 depicts the object positioning plan 4025 of server F3 ofFIG. 39, according to aspects of an exemplary embodiment of theinvention. As shown in FIG. 41, the plan 4025 includes instructions topush an object, and instructions on how to handle subsequent clientrequests for access to that object. According to one embodiment, aserver that pushes an object tells clients seeking access to the objectthat the object has been moved. The pushing server can maintain a cacheof objects that it recently pushed, and when feasible, the pushingserver will supply the requesting client with the location, or server,where the object was moved, thereby providing direct access to theobject for the client.

[0568] As shown in FIG. 41, the plan 4025 calls for server F3 to pushthe large file LF to server F4 for storage thereon, thereby freeing thefastest drives in the DFSS 3900 to store more objects predicted to bemore heavily accessed. Moreover, the plan 4025 includes an indicationthat server F3 will return an indication of staleness for any clientsstill caching the file handle of file LF designating server F3. The plan4025 also indicates that if server F1 requests, server F3 will acceptand store a copy of the streamed file SF and return an indication offile creation to server F1, such as, for example, the file handle ofserver F3's copy of file SF. Thus, the DFSS 3900 uses a pushing approachto ensure server independence in proactively placing objects.

[0569] Based on the foregoing disclosure related to FIGS. 39-41, askilled artisan will recognize the vast scalability of the DFSS 3900.For example, adding or removing hardware components such as drives,resources, or even servers, simply causes updated, or sometimesadditional, load balancing information to be broadcast to the otherservers. Each server then can immediately generate new positioning plansto take full advantage of the new components or configuration of theDFSS 3900. Each server then pushes their respective objects throughoutthe DFSS 3900, thereby efficiently balancing the throughput, capacity,or both, of the same.

[0570] Although the foregoing invention has been described in terms ofcertain preferred embodiments, other embodiments will be apparent tothose of ordinary skill in the art from the disclosure herein. Forexample, the DFSS 3900 may advantageously push new file handles toclients, such as, for example, file handles including information on thelocation of an object. According to another embodiment, the DFSS 3900can advantageously allow servers who have pushed objects to otherservers, to automatically suggest new file handles to requestingclients. However, this approach can have the drawback that the filehandle stored by the old server can itself be outdated, for example,when the new server subsequently pushed the same object to yet anotherserver. Thus, according to one embodiment, servers return indications ofstaleness for objects they not longer have stored on their respectivedisk arrays.

[0571] In addition, a skilled artisan will recognize from the disclosureherein that many of the balancing ideas can be implemented inconventional non-distributed file storage systems. For example, themethod of moving infrequently accessed files to balance capacity so asnot to upset balanced load can be incorporated into conventional datastorage systems.

[0572] Data Flow Architecture

[0573] Each server 130-135 in the DFSS 100 includes storage controllerhardware and storage controller software to manage an array of diskdrives. For example, the servers 130-131 each manage data on the diskarrays 140 and 141. A large number of disk drives can be used, and theDFSS 100 can be accessed by a large number of client machines 110. Thispotentially places a large workload on the servers 130-135. It istherefore desirable that the servers 130-135 operate in an efficientmanner to reduce the occurrence of bottlenecks in the storage system.

[0574] Prior art approaches for storage servers tend to be softwareintensive. Specifically, a programmable CPU in the server becomesinvolved in the movement of data between the client and the disks in thedisk array. This limits the performance of the storage system becausethe server CPU becomes a bottleneck. While prior approaches may have acertain degree of hardware acceleration, such as XOR parity operationsassociated with RAID, these minimal acceleration techniques do notadequately offload the server CPU.

[0575]FIG. 42 shows an architecture for a server, such as the server130, that reduces loading on a CPU 4205 of the server 130. As shown inFIG. 42, the clients 110 communicate (over the network fabric 120, notshown) with one or more network interfaces 4214. The network interfaces4214 communicate with a first I/O bus 4201 shown as a network bus. Thenetwork bus communicates with the CPU 4205 and with a data engine 4210.A first data cache 4218 and a second data cache 4220 are provided to thedata engine 4210. A metadata cache 4216 is provided to the CPU 4205. TheCPU 4205 and the data engine 4210 also communicate with a second I/O bus4202 shown as a storage bus. One or more storage interfaces 4212 alsocommunicate with the second bus 4202.

[0576] The storage interfaces 4212 communicate with the disks 140, 141.In one embodiment, the first I/O bus 4201 is a PCI bus. In oneembodiment, the second I/O bus 4202 is a PCI bus. In one embodiment, thecaches 4216, 4218, and 4220 are non-volatile. In one embodiment, thenetwork interfaces 4214 are Fibre Channel interfaces. In one embodiment,the storage interfaces 4212 are Fibre Channel interfaces. The dataengine 4210 can be a general-purpose processor, a digital signalprocessor, a Field Programmable Gate Array (FPGA), other forms of softor hard programmable logic, a custom ASIC, etc. The network interfacecontrollers 4214, 4212 can support Fibre Channel, Ethernet, Infiniband,or other high performance networking protocols.

[0577] The architecture shown in FIG. 42 allows data to be efficientlymoved between the client machines 110 and disks 140-141 with little orno software intervention by the CPU 4205. The architecture shown in FIG.42 separates the data path from the control message path. The CPU 4205handles control, file system metadata, and housekeeping functions(conceptually, the CPU 4205 can be considered as a control engine).Actual file data passes through the data engine 4210.

[0578] Control messages (e.g. file read/write commands from clients) arerouted to the CPU 4205. The CPU 4205 processes the commands, and queuesdata transfer operations to the data engine 4210. The data transferoperations, once scheduled with the data engine 4210 can be completedwithout further involvement of the CPU 4205. Data passing between thedisks 140, 141 and the clients 110 (either as read or write operations)is buffered through the data cache 4218 and/or the data cache 4220. Inone embodiment, the data engine 4210 operates using a data flowarchitecture that packages instructions with data as the data flowsthrough the data engine 4210 and its associated data caches.

[0579] The data engine 4210 provides a separate path for data flow byconnecting the network interfaces 4214 and the storage interfaces 4212with the data caches 4218, 4220. The data engine 4210 provides file datatransfers between the network interface 4214 and the caches 4218, 4220and between the storage interface 4212 and the caches 4218, 4220. As anexample of the data path operation, consider a client file readoperation. A client read request is received on one of the networkinterfaces 4214 and is routed to the CPU 4205. The CPU 4205 validatesthe request, and determines from the request which data is desired. Therequest will typically specify a file to be read, and the particularsection of data within the file. The CPU 4205 will use file metadata inthe cache 4216 to determine if the data is already present in one of thedata caches 4218, 4220, or if the data must be retrieved from the disks140, 141. If the data is in the data cache 4218, 4220, the CPU 4205 willqueue a transfer with the network interfaces 4214 to transfer the datadirectly from the appropriate data cache 4218, 4220 to the requestingclient 110, with no further intervention by the CPU 4205. If the data isnot in the data caches 4218, 4220, then the CPU 4205 will queue one ormore transfers with the storage interfaces 4212 to move the data fromthe disks 140, 141 to the data caches 4218, 4220, again without furtherintervention by the CPU 4205. When the data is in the data caches 4218,4220, the CPU 4205 will queue a transfer on the network interfaces 4214to move the data to the requesting client 110, again without furtherintervention by the CPU 4205.

[0580] One aspect of the operation of the data engine 4210 is that theCPU 4205 schedules data movement operations by writing an entry onto aqueue in the network interfaces 4214 or into a queue in the storageinterfaces 4212. The data engine 4210 and the network and storageinterfaces 4214, 4212 are connected by busses 4201, 4202. The busses4201, 4202 each include an address bus and a data bus. In oneembodiment, the network or storage interfaces 4214, 4212 perform theactual data movement (or sequence of data movements) independently ofthe CPU 4205 by encoding an instruction code in the address bus thatconnects the data engine to the interface. The instruction code is setup by the host CPU 4205 when the transfer is queued, and can specifythat data is to be written or read to one or both of the cache memories4218, 4220. In addition, the instruction code can specify that anoperation such as a parity XOR operation or a data conversion operationbe performed on the data while it is in transit through the data engine4210. Because instructions are queued with the data transfers, the hostCPU can queue hundreds or thousands of instructions in advance with eachinterface 4214, 4212, and all of these instructions can be can becompleted asynchronously and autonomously. As described above, once adata movement operation has been queued, the data engine 4210 offloadsthe CPU 4205 from direct involvement in the actual movement of data fromthe clients 110 to the disks 140, 141, and vice-versa. The CPU 4205schedules network transfers by queuing data transfer operations on thenetwork interfaces 4214 and the storage interfaces 4212. The interfaces4214 and 4212 then communicate directly with the data engine 4210 toperform the data transfer operations. Some data transfer operationsinvolve the movement of data. Other data transfer operations combine themovement of data with other operations that are to be performed on thedata in transit (e.g., parity generation, data recovery, dataconversion, etc.). The processing modules in the data engine 4210 canperform five principal operations, as well as a variety of supportoperations. The principal operations are:

[0581] 1) read from cache

[0582] 2) write to cache

[0583] 3) XOR write to cache

[0584] 4) write to one cache with XOR write to other cache

[0585] 5) write to both caches

[0586] A typical client file read operation would proceed as follows inthe server 130:

[0587] (1) The file read command is received from the client

[0588] (2) The CPU 4205 authenticates client access and accesspermissions. The CPU 4205 also does metadata lookups to locate therequested data in cache or on disk.

[0589] (3) If data is not in cache, a disk read transaction is queued bysending instructions to the storage interfaces 4212.

[0590] (4) The storage interfaces 4212 mode data from disk to the datacaches 4218, 4220.

[0591] (5) The CPU 4205 queue a data-send transaction to the networkinterfaces 4214.

[0592] (6) The network interfaces 4214 send the data to the client,completing the client read operation.

[0593]FIG. 43 is a block diagram of the internal structure of an ASIC4310 that is one example of a hardware embodiment of the data engine4210. The ASIC 4310 provides the capability for autonomous movement ofdata between the network interfaces 4214 and data caches 4218, 4220, andbetween the storage interfaces, 4212 and the data caches 4218, 4220. Theinvolvement of the CPU 4205 is often just queuing the desired transferoperations. The ASIC 4310 supports this autonomy by combining anasynchronous data flow architecture, a high-performance data path thancan operate independently of the data paths of the CPU 4205, and a datacache memory subsystem. The ASIC 4310 also implements the paritygeneration functions used to support a RAID-style data protectionscheme.

[0594] The data ASIC 4310 is a special-purpose parallel processingsystem that is data-flow driven. That is, the instructions for theparallel processing elements are embedded in data packets that are fedto the ASIC 4310 and to the various functional blocks within the ASIC4310.

[0595] In one embodiment, the ASIC 4310 has four principal interfaces: afirst data cache interface 4318, a second data cache interface 4320, afirst bus interface 4301, and a second bus interface 4302. Otherversions of the ASIC 4310 can have a different number of interfacesdepending on performance goals.

[0596] Data from the first data cache interface 4318 is provided to acache read buffer 4330, to a feedback buffer 4338, to a feedback buffer4340 and to a cache read buffer 4348. Data from the second data cacheinterface 4320 is provided to a cache read buffer 4331, to a feedbackbuffer 4339, to a feedback buffer 4341 and to a cache read buffer 4349.

[0597] Data is provided from the bus interface 4301 through a writebuffer 4336 to a parity engine 4334. Data is provided from the parityengine 4334 through a cache write buffer 4332 to the cache interface4318. Data is provided from the feedback buffer 4338 to the parityengine 4334.

[0598] Data is provided from the bus interface 4302 through a writebuffer 4346 to a parity engine 4344.

[0599] Data is provided from the parity engine 4344 through a cachewrite buffer 4342 to the cache interface 4318. Data is provided from thefeedback buffer 4340 to the parity engine 4344.

[0600] Data is provided from the bus interface 4301 through a writebuffer 4337 to a parity engine 4335. Data is provided from the parityengine 4335 through a cache write buffer 4333 to the cache interface4320. Data is provided from the feedback buffer 4339 to the parityengine 4335.

[0601] Data is provided from the bus interface 4302 through a writebuffer 4347 to a parity engine 4345. Data is provided from the parityengine 4345 through a cache write buffer 4343 to the cache interface4320. Data is provided from the feedback buffer 4341 to the parityengine 4345.

[0602] Data is provided from the cache read buffers 4348, 4349 to thebus interface 4202. Data is provided from the cache read buffers 4330,4331 to the bus interface 4201.

[0603] Data transfer paths are provided between the cache interface 4218and the bus interface 4301 and 4302. Similarly, data transfer paths areprovided between the cache interface 4220 and the bus interfaces 4301and 4302. A control logic 4380 includes, in each of these data path, aprocessing engine that controls data movement between the respectiveinterfaces as well as operations that can be performed on the data as itmoves between the interfaces. The control logic 4380 is data-flow drivenas described above.

[0604] In one embodiment, the bus 4201 is a PCI bus, the bus 4202 is aPCI bus, and data-transfer commands for the data engine are contained inPCI addresses on the respective buses. FIG. 44 is a map 4400 of datafields in a 64-bit data transfer instruction to the data engine for usewith a 64-bit PCI bus. A cache address is coded in bits 0-31. A parityindex is coded in bits 35-50. An opcode is coded in bits 56-58. A blocksize is coded in bits 59-61. A PCI device address is coded in bits62-63. Bits 32-34 and 51-55 are unused.

[0605] The block size is used to select the extent of a block addressedby the parity index. This is the number of consecutive 16 kilobyteblocks that make up the parity block addressed by the parity index. Inone embodiment, the block size is three bits, interpreted as follows:block size = 0 parity block = 16 k block size = 1 parity block = 32 kblock size = 2 parity block = 64 k block size = 3 parity block = 128 kblock size = 4 parity block = 256 k block size = 5 parity block = 512 kblock size = 6 parity block = 1024 k block size = 7 parity block = 2048k

[0606] In one embodiment, the bus interface 4301 is a PCI interface andthe bus interface 4302 is a PCI interface. Each of these PCI interfacesincludes a read control to control reads from the caches 4218 and 4220.The read control reads data from the respective output buffers 4330,4331, 4348, and 4349 as needed. On completion of a PCI transaction, theoutput buffer is cleared. Each PCI interface also includes a writecontrol to control writes to the input buffers. The write control addsan address word to the start of a data stream and control bits to eachword written to the input buffer. In the case where parity is generatedand data is saved, the write control: determines which cache 4218, 4220gets the data; assigns parity to the other cache (that is, the cachethat does not receive the data); and adds control bits to the datastream. Address words are typically identical for the various inputbuffers, but added control bits will be different for each input buffer.For parity generation, or regeneration of lost data, the data in transitis stored in one of the feedback buffers 4338, 4339, 4341, or 4340. Thefeedback buffer is cleared on completion of a data stream operation.

[0607] As described above, each data block written to an input bufferhas address and control bits inserted into the data stream. The controlbits are as follows:

[0608] bit 0: identifies a word as an address/control word or a dataword

[0609] bit 1: set to tag last word in a data stream

[0610] bit 2: enable/disable XOR (enable/disable parity operations)

[0611] bit 3: for an address word, specifies type of addressing aseither: index addressing (for parity and regeneration data) directaddressing (for normal data)

[0612] For operations that include an XOR operation, the XOR destinationis a “parity block” in cache (e.g., in the cache 4218 or the cache4220). When a parity block is addressed the address is calculated from acombination of: the parity index field from the PCI address word; thelower bits of the PCI address bus (the number depending on the blocksize); and the block size field from the PCI address word. Once the ASIC4310 calculates the parity block address for the first PCI data word,this address is incremented for each subsequent data word.

[0613] The parity block address can be generated from the PCI addressword using one of two methods. The first method is to concatenate theparity index with the lower bits of the PCI address word. The secondmethod is to sum the parity index with the lower bits of the PCI addressword. In either method, data is typically aligned to a natural boundary(e.g., 16 k blocks to a 16 k boundary, 32 k blocks to a 32 k boundary,etc.).

[0614] The CPU 4205 queues network transaction requests to the networkinterfaces 4214 and storage transaction requests to the storageinterfaces 4212. In one embodiment, the network bus 4201 is amemory-mapped bus having an address word and one or more data words(such as, for example, a PCI bus) and queuing a storage transactionrequest involves sending an address word and one or more data words to aselected network interface 4214. In one embodiment, the address wordincludes opcode bits and address bits as shown in FIG. 44. The datawords provide information to the selected network interface 4214regarding what to do with the data at the specified address (e.g., whereto send the data and to notify the CPU 4205 when the data has beensent). In one embodiment, the selected network interface 4214 views thedata engine 4210 (e.g., the ASIC 4310) as simply a memory to use forretrieving and storing data using addresses in the address word includedin the network transaction request. In such an embodiment, the networkinterface 4214 does not know that the data engine 4210 is interpretingvarious bits of the address word as opcode bits and that the data engine4210 is performing operations (e.g., parity operations) on the data.

[0615] The storage interfaces 4212 operate with the data engine 4210(e.g., the ASIC 4310) in a similar manner. The storage interfaces 4212view the data engine 4210 as a memory (e.g., a simple cache). Thestorage interfaces 4212 communicate with the disks 140, 141 to retrievedata from the disks and write data to the disks. The data engine 4210takes care of assembling parity groups, computing parity, recoveringlost data, etc. “Hiding” the parity calculations in the data engine 4210offloads the parity workload from the CPU 4205, thereby giving the CPU4205 more time for metadata operations. Moreover, using a portion of thememory-mapped bus address word allows the CPU 4205 to send commands tothe data engine 4210, again offloading data operations from the CPU4205. The commands are associated with the data (by virtue of beingassociated with the address of the data). The network interfaces 4214and the storage interfaces 4212 (which, themselves are typicallynetwork-type interfaces such as Fibre Channel interfaces, SCSIinterfaces, InfiniBand interfaces, etc.) are unaware of the opcodeinformation buried in the address words. This allows standard“off-the-shelf” interfaces to be used.

[0616] In one embodiment, the CPU 4205 keeps track of the data stored inthe data caches 4218 and 4220, thus allowing the server 130 to servicemany client requests for file data directly from the caches 4218 and4220 to the network interfaces 4214, without the overhead of diskoperations.

[0617] Although the foregoing description of the invention has shown,described and pointed out novel features of the invention, it will beunderstood that various omissions, substitutions, and changes in theform of the detail of the apparatus as illustrated, as well as the usesthereof, may be made by those skilled in the art without departing fromthe spirit of the present invention. Consequently the scope of theinvention should not be limited to the foregoing discussion but shouldbe defined by the appended claims.

What is claimed is:
 1. A computer storage system, comprising: aplurality of disk drives for storing distributed parity groups, eachdistributed parity group comprising storage blocks, said storage blockscomprising one or more data blocks and a parity block associated withsaid one or more data blocks, each of said storage blocks stored on aseparate disk drive such that no two storage blocks from a given paritygroup reside on the same disk drive; file system metadata to describe alocation of each of said storage blocks; a resource-allocation module torecognize a new disk drive hot-swapped into said plurality of diskdrives during file system operation and to use said new disk drive tostore one or more storage blocks.
 2. The computer storage system ofclaim 1, wherein a size of a first distributed parity group is largerthan a size of a second distributed parity group within a first file. 3.The computer storage system of claim 1, further comprising metadata tospecify which disk drive in said plurality of disk drives and said newdisk drive contains each storage block.
 4. The computer storage systemof claim 1, wherein said new disk drive is provided to a Fibre Channelnetwork.
 5. The computer storage system of claim 1, wherein a file isorganized as one or more of said distributed parity groups.
 6. Thecomputer storage system of claim 1, wherein said file system metadatacomprises information to specify a logical block address for eachstorage block in a distributed parity group.
 7. The computer storagefile system of claim 1, wherein an extent of a first distributed paritygroup of a file is larger than an extent of a second distributed paritygroup of said file.
 8. The computer storage file system of claim 23,further comprising a load-balancing module to distribute one or moreexisting storage blocks to said new disk drive.
 9. A method forhot-swapping a new storage device in a storage system, comprising:recognizing said new storage device; adding said new disk drive to alist of previously-available storage devices to produce a list ofcurrently-available storage devices; determining a size of a new paritygroup, said size describing a number of data blocks in said new paritygroup; computing a parity block for said parity group; and storing oneof said data blocks or said parity block on said new storage device. 10.The method of claim 9, further comprising storing metadata to describe adisk and logical block location of each of said data blocks and saidparity block.
 11. The method of claim 9, further comprising combining afirst parity group having a first size and a second parity group havinga second size to produce a combined parity group having a third size,wherein said third size specifies a number of data blocks that is oneless than the number of currently-available storage devices.
 12. Themethod of claim 9, wherein said new storage device comprises a diskdrive.
 13. The method of claim 9, further comprising: recognizing that aselected storage device has gone offline, removing said selected storagedevice from said list of currently-available storage devices to producea list of remaining storage devices; reconstructing data stored on saidselected storage device; storing said reconstructed data on one or moreof said remaining storage devices; and updating file system metadata tofacilitate locating said reconstructed data.